作者(外文):Deng, Jin-Jie
論文名稱(外文):Combination of Run-Length FM-Index and Grammar Compression by Induced Suffix Sorting
指導教授(外文):Hon, Wing-Kai
口試委員(外文):Lee, Che-Rung
Tsai, Meng-Tsung
中文關鍵詞:運行長度編碼壓縮 Burrows-Wheeler 轉換文法壓縮
外文關鍵詞:RLBWTgrammar compression
在針對像是生物基因序列這類內容具有高度重複性的文本所設計的壓縮文本索引之中,支援在運行長度編碼壓縮 Burrows-Wheeler 轉換上執行反序搜索的 FM 索引可謂是佼佼者。雖然相較於文法壓縮文本索引,運行長度編碼壓縮 Burrows-Wheeler 轉換所需的儲存空間較大,但是在回答長文字型樣在文本中出現頻率這種詢問的反應時間卻是比文法壓縮文本索引來得少。在這篇論文中,我們藉由使用基於誘導式後綴排序的特殊文法壓縮後的文本來建構運行長度編碼壓縮 Burrows-Wheeler 轉換的方式,去同時顯現運行長度編碼壓縮 Burrows-Wheeler 轉換跟文法壓縮文字索引的優點。如果對生物基因序列建構我們提出的文本索引,實驗結果顯示,我們提出的文本索引所需要的儲存空間比運行長度編碼壓縮 Burrows-Wheeler 轉換來得少,並且在詢問長文字型樣在文本中出現頻率的反應時間也比較少。這似乎意味著我們提出的文本索引適合應用在生物資訊領域中處理長片段比對的問題。
Using run-length compressed Burrows–Wheeler transform (RLBWT) in conjunction with backward search in FM-index is the centerpiece of most compressed indexes working on highly-repetitive data sets like biological sequences. Compared to grammar indexes, the size of RLBWT is often much bigger, but queries like counting the occurrences of long patterns can be done much faster than on any existing grammar index. In this thesis, we combine the virtues of a grammar with RLBWT by building the RLBWT on top of a special grammar based on induced suffix sorting. Our experiments reveal that our hybrid approach outperforms the classic RLBWT with respect to the index sizes, and
with respect to query times on biological data sets for sufficiently long patterns, which could be interesting for aligning long reads in bioinformatics.
摘 要 i
Abstract ii
Acknowledgement iii
Contents iv
List of Figures v
List of Tables vi
1 Introduction and Preliminaries 1
2 FM-Indexing with GCIS Grammar 5
2.1 Pattern Matching 5
2.2 Limiting Factor Lengths 8
3 Implementation and Evaluation 13
4 Future Work 16
Bibliography 17
