帳號:guest(3.145.64.152)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):郭俊廷
作者(外文):Kuo, Chun-Ting
論文名稱(中文):版本文件中對時間查詢最佳化之索引架構
論文名稱(外文):Index Framework for Efficient Time-Travel Phrase Queries on Versioned Documents
指導教授(中文):韓永楷
指導教授(外文):Hon, Wing Kai
口試委員(中文):李哲榮
姚兆明
口試委員(外文):Lee, Cherung
Yiu, Siu-Ming
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:101062501
出版年(民國):103
畢業學年度:102
語文別:英文
論文頁數:37
中文關鍵詞:文件檢索索引版本文件片語搜尋
外文關鍵詞:Document RetrievalIndexingVersioned DocumentPhrase Searching
相關次數:
  • 推薦推薦:0
  • 點閱點閱:68
  • 評分評分:*****
  • 下載下載:3
  • 收藏收藏:0
近來版本文件增長有飛躍上的速度。搜索這類文件通常有時間範圍的限制。如何進行索引文件這些種類和回答time-travel phrase query在資料檢索的社群裡被廣泛討論。在本文中,我們提出了基於後綴樹索引文件新的框架來幫這些文件做索引。這個框架可以在保證空間複雜度O(nlogn + nlogT) bits和查詢的複雜度O(plogn + k)來存儲版本文件的索引,並支持任何pattern的查詢。在實作中,我們討論了一些實際問題,在此框架,並用簡潔的數據結構,減少我們的空間到接近為O(nlogT) bits。同時,我們也做了實驗來證明我們的概念。最後,我們還簡要提出了關於這個框架的延伸.
The volume of versioned documents is growing very quickly nowadays. How to exploit the redundancy among the documents to index the documents compactly, while supporting effi- cient time-constrained keyword queries has been a hot topic recently [Anand et al., SIGIR’11, SIGIR’12; He et al., CIKM’09, CIKM’10, SIGIR’11, SIGIR’12]. In this paper, we propose a new framework to index versioned documents, and extend the keyword queries into the more general phrase queries. Our index is based on suffix tree, taking O(n) space and answering a one-sided time-constrained phrase query for any phrase P in O((|P | + k) log n) time, where n is the dataset size and k is the output size. We discuss how to tune our framework with realistic assumptions; our experiments shows that under similar space budgets, our index supports queries five times faster than the baseline inverted lists when the query length |P| is at least four.
1 Introduction
1.1 Organization
2 Preliminaries
2.1 Generalized Suffix Tree
2.2 2-Dimensional Orthogonal Range Query
3 The Framework
3.1 Problem Definition
3.2 Index Method
3.3 QueryMethod
3.4 An Example Instantiation
4 Practical Optimization
4.1 Data Preprocessing
4.2 Framework Minimization
4.3 Query Enhancement
5 Experiment
5.1 Experimenting the Index Space
5.2 Experimenting the Query Performance
6 Further Discussion
7 Conclusion
[1] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Efficient tem- poral keyword queries over versioned text. In Proc. of ACM CIKM Conf, 2010.
[2] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Temporal index sharding for space-time efficiency in archive search. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Re- trieval, pages 545–554. ACM, 2011.
[3] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Index mainte- nance for time-travel text search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 235–244. ACM, 2012.
[4] Peter G Anick and Rex A Flynn. Versioning a full-text information retrieval system. In
Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pages 98–111. ACM, 1992.
32
[5] Krisztian Balog, Yi Fang, Maarten de Rijke, Pavel Serdyukov, Luo Si, et al. Expertise retrieval. Foundations and Trends in Information Retrieval, 6(2-3):127–256, 2012.
[6] Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. A time machine for text search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 519–526. ACM, 2007.
[7] Gobinda Chowdhury. Introduction to modern information retrieval. Facet publishing, 2010.
[8] J Shane Culpepper, Matthias Petri, and Falk Scholer. Efficient in-memory top-k doc- ument retrieval. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 225–234. ACM, 2012.
[9] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S Muthukrishnan. Compress- ing and searching xml data via two zips. In Proceedings of the 15th international con- ference on World Wide Web, pages 751–760. ACM, 2006.
[10] Paolo Ferragina and Rossano Venturini. System and method for string processing and searching using a compressed permuterm index, August 29 2007. US Patent App. 11/897,427.
33
[11] Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the fourteenth annual ACM-SIAM symposium on Dis- crete algorithms, pages 841–850. Society for Industrial and Applied Mathematics, 2003.
[12] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378–407, 2005.
[13] Antonin Guttman. R-trees: A dynamic index structure for spatial searching, volume 14. ACM, 1984.
[14] Jinru He and Torsten Suel. Faster temporal range queries over versioned text. In Pro- ceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 565–574. ACM, 2011.
[15] Jinru He and Torsten Suel. Optimizing positional index structures for versioned doc- ument collections. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 245–254. ACM, 2012.
[16] Jinru He, Hao Yan, and Torsten Suel. Compact full-text indexing of versioned document collections. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 415–424. ACM, 2009.
34
[17] Jinru He, Junyuan Zeng, and Torsten Suel. Improved index compression techniques for versioned document collections. In Proceedings of the 19th ACM international confer- ence on Information and knowledge management, pages 1239–1248. ACM, 2010.
[18] Hitwise. Hitwise: Search queries are getting longer.
[19] Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Space-efficient framework for
top-k string retrieval problems. In Foundations of Computer Science, 2009. FOCS’09. 50th Annual IEEE Symposium on, pages 713–722. IEEE, 2009.
[20] http://www.keyworddiscovery.com/. Keyword and search engines statistics.
[21] Stefan Kurtz. Reducing the space requirement of suffix trees. Software-Practice and
Experience, 29(13):1149–71, 1999.
[22] Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg, et al. Ultrafast and
memory-efficient alignment of short dna sequences to the human genome. Genome Biol, 10(3):R25, 2009.
[23] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows– wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.
[24] Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kristiansen, and Jun Wang. Soap2: an improved ultrafast tool for short read alignment. Bioinfor- matics, 25(15):1966–1967, 2009.
35
[25] Chi-Man Liu, Thomas Wong, Edward Wu, Ruibang Luo, Siu-Ming Yiu, Yingrui Li, Bingqiang Wang, Chang Yu, Xiaowen Chu, Kaiyong Zhao, et al. Soap3: ultra-fast gpu-based parallel alignment tool for short reads. Bioinformatics, 28(6):878–879, 2012.
[26] Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935–948, 1993.
[27] Edward M McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM (JACM), 23(2):262–272, 1976.
[28] S Muthukrishnan. Efficient algorithms for document retrieval problems. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 657–666. Society for Industrial and Applied Mathematics, 2002.
[29] Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms, 2013.
[30] Enno Ohlebusch, Johannes Fischer, and Simon Gog. Cst++. In String Processing and
Information Retrieval, pages 322–333. Springer, 2010.
[31] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank
citation ranking: Bringing order to the web. 1999.
[32] Manish Patil, Sharma V Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter,
and Sabrina Chandrasekaran. Inverted indexes for phrases and strings. In Proceed- ings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 555–564. ACM, 2011.
36
[33] Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory of Comput- ing Systems, 41(4):589–607, 2007.
[34] Gerard Salton, Edward A Fox, and Harry Wu. Extended boolean information retrieval. Communications of the ACM, 26(11):1022–1036, 1983.
[35] Peter Weiner. Linear pattern matching algorithms. In Switching and Automata Theory, 1973. SWAT’08. IEEE Conference Record of 14th Annual Symposium on, pages 1–11. IEEE, 1973.
[36] Ian H Witten, Alistair Moffat, and Timothy C Bell. Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, 1999.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *