帳號:guest(          離開系統
字體大小: 字級放大   字級縮小   預設字形  


作者(外文):Kuo, Chun-Ting
論文名稱(外文):Index Framework for Efficient Time-Travel Phrase Queries on Versioned Documents
指導教授(外文):Hon, Wing Kai
口試委員(外文):Lee, Cherung
Yiu, Siu-Ming
外文關鍵詞:Document RetrievalIndexingVersioned DocumentPhrase Searching
  • 推薦推薦:0
  • 點閱點閱:68
  • 評分評分:*****
  • 下載下載:3
  • 收藏收藏:0
近來版本文件增長有飛躍上的速度。搜索這類文件通常有時間範圍的限制。如何進行索引文件這些種類和回答time-travel phrase query在資料檢索的社群裡被廣泛討論。在本文中,我們提出了基於後綴樹索引文件新的框架來幫這些文件做索引。這個框架可以在保證空間複雜度O(nlogn + nlogT) bits和查詢的複雜度O(plogn + k)來存儲版本文件的索引,並支持任何pattern的查詢。在實作中,我們討論了一些實際問題,在此框架,並用簡潔的數據結構,減少我們的空間到接近為O(nlogT) bits。同時,我們也做了實驗來證明我們的概念。最後,我們還簡要提出了關於這個框架的延伸.
The volume of versioned documents is growing very quickly nowadays. How to exploit the redundancy among the documents to index the documents compactly, while supporting effi- cient time-constrained keyword queries has been a hot topic recently [Anand et al., SIGIR’11, SIGIR’12; He et al., CIKM’09, CIKM’10, SIGIR’11, SIGIR’12]. In this paper, we propose a new framework to index versioned documents, and extend the keyword queries into the more general phrase queries. Our index is based on suffix tree, taking O(n) space and answering a one-sided time-constrained phrase query for any phrase P in O((|P | + k) log n) time, where n is the dataset size and k is the output size. We discuss how to tune our framework with realistic assumptions; our experiments shows that under similar space budgets, our index supports queries five times faster than the baseline inverted lists when the query length |P| is at least four.
1 Introduction
1.1 Organization
2 Preliminaries
2.1 Generalized Suffix Tree
2.2 2-Dimensional Orthogonal Range Query
3 The Framework
3.1 Problem Definition
3.2 Index Method
3.3 QueryMethod
3.4 An Example Instantiation
4 Practical Optimization
4.1 Data Preprocessing
4.2 Framework Minimization
4.3 Query Enhancement
5 Experiment
5.1 Experimenting the Index Space
5.2 Experimenting the Query Performance
6 Further Discussion
7 Conclusion
[1] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Efficient tem- poral keyword queries over versioned text. In Proc. of ACM CIKM Conf, 2010.
[2] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Temporal index sharding for space-time efficiency in archive search. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Re- trieval, pages 545–554. ACM, 2011.
[3] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Index mainte- nance for time-travel text search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 235–244. ACM, 2012.
[4] Peter G Anick and Rex A Flynn. Versioning a full-text information retrieval system. In
Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pages 98–111. ACM, 1992.
[5] Krisztian Balog, Yi Fang, Maarten de Rijke, Pavel Serdyukov, Luo Si, et al. Expertise retrieval. Foundations and Trends in Information Retrieval, 6(2-3):127–256, 2012.
[6] Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. A time machine for text search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 519–526. ACM, 2007.
[7] Gobinda Chowdhury. Introduction to modern information retrieval. Facet publishing, 2010.
[8] J Shane Culpepper, Matthias Petri, and Falk Scholer. Efficient in-memory top-k doc- ument retrieval. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 225–234. ACM, 2012.
[9] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S Muthukrishnan. Compress- ing and searching xml data via two zips. In Proceedings of the 15th international con- ference on World Wide Web, pages 751–760. ACM, 2006.
[10] Paolo Ferragina and Rossano Venturini. System and method for string processing and searching using a compressed permuterm index, August 29 2007. US Patent App. 11/897,427.
[11] Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the fourteenth annual ACM-SIAM symposium on Dis- crete algorithms, pages 841–850. Society for Industrial and Applied Mathematics, 2003.
[12] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378–407, 2005.
[13] Antonin Guttman. R-trees: A dynamic index structure for spatial searching, volume 14. ACM, 1984.
[14] Jinru He and Torsten Suel. Faster temporal range queries over versioned text. In Pro- ceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 565–574. ACM, 2011.
[15] Jinru He and Torsten Suel. Optimizing positional index structures for versioned doc- ument collections. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 245–254. ACM, 2012.
[16] Jinru He, Hao Yan, and Torsten Suel. Compact full-text indexing of versioned document collections. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 415–424. ACM, 2009.
[17] Jinru He, Junyuan Zeng, and Torsten Suel. Improved index compression techniques for versioned document collections. In Proceedings of the 19th ACM international confer- ence on Information and knowledge management, pages 1239–1248. ACM, 2010.
[18] Hitwise. Hitwise: Search queries are getting longer.
[19] Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Space-efficient framework for
top-k string retrieval problems. In Foundations of Computer Science, 2009. FOCS’09. 50th Annual IEEE Symposium on, pages 713–722. IEEE, 2009.
[20] http://www.keyworddiscovery.com/. Keyword and search engines statistics.
[21] Stefan Kurtz. Reducing the space requirement of suffix trees. Software-Practice and
Experience, 29(13):1149–71, 1999.
[22] Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg, et al. Ultrafast and
memory-efficient alignment of short dna sequences to the human genome. Genome Biol, 10(3):R25, 2009.
[23] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows– wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.
[24] Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kristiansen, and Jun Wang. Soap2: an improved ultrafast tool for short read alignment. Bioinfor- matics, 25(15):1966–1967, 2009.
[25] Chi-Man Liu, Thomas Wong, Edward Wu, Ruibang Luo, Siu-Ming Yiu, Yingrui Li, Bingqiang Wang, Chang Yu, Xiaowen Chu, Kaiyong Zhao, et al. Soap3: ultra-fast gpu-based parallel alignment tool for short reads. Bioinformatics, 28(6):878–879, 2012.
[26] Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935–948, 1993.
[27] Edward M McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM (JACM), 23(2):262–272, 1976.
[28] S Muthukrishnan. Efficient algorithms for document retrieval problems. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 657–666. Society for Industrial and Applied Mathematics, 2002.
[29] Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms, 2013.
[30] Enno Ohlebusch, Johannes Fischer, and Simon Gog. Cst++. In String Processing and
Information Retrieval, pages 322–333. Springer, 2010.
[31] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank
citation ranking: Bringing order to the web. 1999.
[32] Manish Patil, Sharma V Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter,
and Sabrina Chandrasekaran. Inverted indexes for phrases and strings. In Proceed- ings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 555–564. ACM, 2011.
[33] Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory of Comput- ing Systems, 41(4):589–607, 2007.
[34] Gerard Salton, Edward A Fox, and Harry Wu. Extended boolean information retrieval. Communications of the ACM, 26(11):1022–1036, 1983.
[35] Peter Weiner. Linear pattern matching algorithms. In Switching and Automata Theory, 1973. SWAT’08. IEEE Conference Record of 14th Annual Symposium on, pages 1–11. IEEE, 1973.
[36] Ian H Witten, Alistair Moffat, and Timothy C Bell. Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, 1999.
第一頁 上一頁 下一頁 最後一頁 top
* *