帳號:guest(3.146.176.88)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):余家鴻
作者(外文):Yu, Jia-Hong
論文名稱(中文):基於Elias-Fano編碼法的反向索引:版本文件的分析與研究
論文名稱(外文):Practical Inverted Index Based on Elias-Fano Encoding: A Case Study of Versioned Documents
指導教授(中文):韓永楷
指導教授(外文):Hon, Wing-Kai
口試委員(中文):李哲榮
盧錦隆
口試委員(外文):Lee, Che-Rung
Lu, Chin-Lung
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系所
學號:104062574
出版年(民國):106
畢業學年度:105
語文別:英文
論文頁數:35
中文關鍵詞:反向索引資料壓縮演算法版本文件Query
外文關鍵詞:Inverted-IndexVersioned-DocumentQueryData-CompressionAlgorithm
相關次數:
  • 推薦推薦:1
  • 點閱點閱:363
  • 評分評分:*****
  • 下載下載:25
  • 收藏收藏:0
反向索引是一種重要而且也是最常使用在文件檢索上的一種技術。然而,現今文件的容量越來越大,對於要做出不同種有用的關鍵字查詢功能,例如,關鍵字出現在哪些文件裡,文件對於這個關鍵字來說的重要程度為何,關鍵字在文件中出現的位置等等,傳統的反向索引將會消耗很大的空間成本。舉例來說,在實務上,只是針對大約300MB的文件資料,我們要用傳統的反向索引,去做關鍵字在文件中出現位置的查詢,在不犧牲查詢時間的情況下,就要消耗大約1.5GB的硬碟空間去儲存它的索引。為了解決這個實務上的問題,我們研究了許多資料壓縮或是索引壓縮的方法。在這篇論文,我們實作出一個在實際空間消耗表現較佳的索引架構,這個架構是基於在現有方法 - Partitioned Elias-Fano 編碼法之上。我們用版本文件資料集 - 維基百科,在實務上去評量我們架構的效能,發現它只消耗了大約150MB的硬碟空間去儲存它的索引,就可以達成大約300MB的資料做出不同功能的查詢。針對這個架構上的查詢方法,我們也提出了兩種不同策略,並且做實驗去深入討論這兩種策略的優劣性與它們分別適合用的情況。
Inverted Index is an important and well-known method for document retrieval. However, as the volume of documents is growing very quickly nowadays, we have to pay
much cost for the basic inverted index to achieve common useful functions like document listing, time-travel, top $k$, and the occurrence reporting of phrase queries.
For example, for supporting the occurrence reporting query, we need around 1.5 GB index space to store the index in the disk just for around 300 MB data.
To solve this space problem, many index compression techniques have been studied.
In this thesis, we propose a practical index framework on good space performance with inverted index based on the recently proposed partitioned Elias-Fano encoding,
and conduct experiments on real data sets. Our index can support the query functions correctly with only around 150 MB index space for 300 MB input real data.
We develop two different methods to query our index, and from the results of our experiments, we discuss what kind of data is more suited for each of these two methods.
1. Introduction 2
1.1 Document Retrieval on Versioned Document . . . . . . . . . . . . . . . . . . 2
1.2 Organization ................................... 4
2. Preliminaries 5
2.1 N-gram ...................................... 5
2.2 InvertedIndex................................... 6
2.3 Elias-FanoEncoding ............................... 8
2.4 PartitionedElias-FanoEncoding......................... 9
3. Practical Framework 11
3.1 ProblemDefinition ................................ 11
3.2 PropertiesonVersionedDocuments....................... 12
3.3 InvertedIndexBasedonElias-FanoEncoding . . . . . . . . . . . . . . . . . 13
3.3.1 DataPreprocessing............................ 13
3.3.2 ProducingDemandedInvertedLists................... 15
3.3.3 ProducingContextLists ......................... 16
3.3.4 Indexing.................................. 16
3.3.5 QueryMethod .............................. 17
4. Experiments 19
4.1 IndexCompressionPerformance......................... 20
4.2 QueryPerformance................................ 22
4.3 ChoiceBetweenTheTwoAlgorithms...................... 26
4.4 MoreDiscussionAboutCompressionPerformance . . . . . . . . . . . . . . . 27
5. Conclusion and Future Work 29
[1]Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Efficient tem- poral keyword search over versioned text. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 699–708. ACM, 2010.
[2] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Temporal index sharding for space-time efficiency in archive search. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Re- trieval, pages 545–554. ACM, 2011.
[3] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Index mainte- nance for time-travel text search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 235–244. ACM, 2012.
[4] Peter G Anick and Rex A Flynn. Versioning a full-text information retrieval system. In
Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pages 98–111. ACM, 1992.
[5] Dirk Bahle, Hugh E Williams, and Justin Zobel. Compaction techniques for nextword indexes. In spire, pages 33–45, 2001.
[6] Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. Flux- capacitor: efficient time-travel text search. In Proceedings of the 33rd international conference on Very large data bases, pages 1414–1417. VLDB Endowment, 2007.
[7] Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. A time machine for text search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 519–526. ACM, 2007.
[8] David C Blair and Melvin E Maron. An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM, 28(3):289–299, 1985.
[9] Francisco Claude, Antonio Farina, Miguel A Mart ́ınez-Prieto, and Gonzalo Navarro. Compressed q-gram indexing for highly repetitive biological sequences. In BioInformat- ics and BioEngineering (BIBE), 2010 IEEE International Conference on, pages 86–91. IEEE, 2010.
[10] Francisco Claude, Antonio Farin ̃a, Miguel A Mart ́ınez-Prieto, and Gonzalo Navarro. Indexes for highly repetitive document collections. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 463–468. ACM, 2011.
[11] J Culpepper, Gonzalo Navarro, Simon Puglisi, and Andrew Turpin. Top-k ranked document search in general text databases. Algorithms–ESA 2010, pages 194–205, 2010.
[12] Peter Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM (JACM), 21(2):246–260, 1974.
[13] Jinru He and Torsten Suel. Faster temporal range queries over versioned text. In Pro- ceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 565–574. ACM, 2011.
[14] Jinru He and Torsten Suel. Optimizing positional index structures for versioned doc- ument collections. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 245–254. ACM, 2012.
[15] Wing-Kai Hon, Manish Patil, Rahul Shah, and Shih-Bin Wu. Efficient index for re- trieving top-k most frequent documents. Journal of Discrete Algorithms, 8(4):402–417, 2010.
[16] Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Space-efficient framework for top-k string retrieval problems. In Foundations of Computer Science, 2009. FOCS’09. 50th Annual IEEE Symposium on, pages 713–722. IEEE, 2009.
[17] Chung-Han Hsin. More space efficient and practical framework for time travel phrase queries on versioned documents. 2017.
[18] Chun-Ting Kuo and Wing-Kai Hon. Practical index framework for efficient time-travel phrase queries on versioned documents. In Data Compression Conference (DCC), 2016, pages 556–565. IEEE, 2016.
[19] Veli M ̈akinen, Gonzalo Navarro, Jouni Sir ́en, and Niko V ̈alima ̈ki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281– 308, 2010.
[20] Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935–948, 1993.
[21] Edward M McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM (JACM), 23(2):262–272, 1976.
[22] Giuseppe Ottaviano and Rossano Venturini. Partitioned elias-fano indexes. In Proceed- ings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 273–282. ACM, 2014.
[23] Manish Patil, Sharma V Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter, and Sabrina Chandrasekaran. Inverted indexes for phrases and strings. In Proceed- ings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 555–564. ACM, 2011.
[24] Vijayakumar Rangarajan and Natarajan Ravichandran. System and method for portable document indexing using n-gram word decomposition, January 6 1998. US Patent 5,706,365.
[25] Gerard Salton. The smart environment for retrieval system evaluation–advantages and problem areas. 1981.
[26] Mark Sanderson and W Bruce Croft. The history of information retrieval research. Proceedings of the IEEE, 100(Special Centennial Issue):1444–1451, 2012.
[27] Jouni Sir ́en, Niko V ̈alima ̈ki, Veli Ma ̈kinen, and Gonzalo Navarro. Run-length com- pressed indexes are superior for highly repetitive sequence collections. In SPIRE, vol- ume 8, pages 164–175. Springer, 2008.
[28] Sebastiano Vigna. Quasi-succinct indices. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 83–92. ACM, 2013.
[29] Wikipedia. Inverted index — wikipedia, the free encyclopedia, 2017.
[30] Wikipedia. N-gram — wikipedia, the free encyclopedia, 2017.
[31] Wikipedia. Suffix tree — wikipedia, the free encyclopedia, 2017.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *