運行長度編碼FM索引與後綴排序誘導式文法壓縮的結合__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.96) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	鄧晉杰
作者(外文):	Deng, Jin-Jie
論文名稱(中文):	運行長度編碼FM索引與後綴排序誘導式文法壓縮的結合
論文名稱(外文):	Combination of Run-Length FM-Index and Grammar Compression by Induced Suffix Sorting
指導教授(中文):	韓永楷
指導教授(外文):	Hon, Wing-Kai
口試委員(中文):	李哲榮蔡孟宗
口試委員(外文):	Lee, Che-Rung Tsai, Meng-Tsung
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	資訊工程學系
學號:	108062520
出版年(民國):	111
畢業學年度:	110
語文別:	中文
論文頁數:	18
中文關鍵詞:	運行長度編碼壓縮 Burrows-Wheeler 轉換、文法壓縮
外文關鍵詞:	RLBWT、grammar compression
相關次數:	推薦:0 點閱:158 評分: 下載:0 收藏:0

在針對像是生物基因序列這類內容具有高度重複性的文本所設計的壓縮文本索引之中，支援在運行長度編碼壓縮 Burrows-Wheeler 轉換上執行反序搜索的 FM 索引可謂是佼佼者。雖然相較於文法壓縮文本索引，運行長度編碼壓縮 Burrows-Wheeler 轉換所需的儲存空間較大，但是在回答長文字型樣在文本中出現頻率這種詢問的反應時間卻是比文法壓縮文本索引來得少。在這篇論文中，我們藉由使用基於誘導式後綴排序的特殊文法壓縮後的文本來建構運行長度編碼壓縮 Burrows-Wheeler 轉換的方式，去同時顯現運行長度編碼壓縮 Burrows-Wheeler 轉換跟文法壓縮文字索引的優點。如果對生物基因序列建構我們提出的文本索引，實驗結果顯示，我們提出的文本索引所需要的儲存空間比運行長度編碼壓縮 Burrows-Wheeler 轉換來得少，並且在詢問長文字型樣在文本中出現頻率的反應時間也比較少。這似乎意味著我們提出的文本索引適合應用在生物資訊領域中處理長片段比對的問題。

Using run-length compressed Burrows–Wheeler transform (RLBWT) in conjunction with backward search in FM-index is the centerpiece of most compressed indexes working on highly-repetitive data sets like biological sequences. Compared to grammar indexes, the size of RLBWT is often much bigger, but queries like counting the occurrences of long patterns can be done much faster than on any existing grammar index. In this thesis, we combine the virtues of a grammar with RLBWT by building the RLBWT on top of a special grammar based on induced suffix sorting. Our experiments reveal that our hybrid approach outperforms the classic RLBWT with respect to the index sizes, and
with respect to query times on biological data sets for sufficiently long patterns, which could be interesting for aligning long reads in bioinformatics.

摘要 i
Abstract ii
Acknowledgement iii
Contents iv
List of Figures v
List of Tables vi
1 Introduction and Preliminaries 1
2 FM-Indexing with GCIS Grammar 5
2.1 Pattern Matching 5
2.2 Limiting Factor Lengths 8
3 Implementation and Evaluation 13
4 Future Work 16
Bibliography 17

T. Akagi, D. Köppl, Y. Nakashima, S. Inenaga, H. Bannai, and M. Takeda. Grammar index by induced suffix sorting. In Proc. SPIRE, volume 12944 of LNCS, pages 85–99, 2021.
J. Barbay, F. Claude, T. Gagie, G. Navarro, and Y. Nekrich. Efficient fully-compressed sequence representations. Algorithmica, 69(1):232–268, 2014.
M. Burrows and D. J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, 1994.
Y. Chien, W. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. Geometric BWT: compressed text indexing via sparse suffixes and range searching. Algorithmica, 71(2):258–278, 2015.
F. Claude, A. Fariña, M. A. Martı́nez-Prieto, and G. Navarro. Universal indexes for highly repetitive document collections. Inf. Syst., 61:1–23, 2016.
G. Cormode and S. Muthukrishnan. The string edit distance matching problem with moves. ACM Trans. Algorithms, 3(1):2:1–2:19, 2007.
J. J. Deng, W. Hon, D. Köppl, and K. Sadakane. FM-indexing grammars induced by suffix sorting for long patterns. ArXiv CoRR, abs/2110.01181, 2021.
D. Dı́az-Domı́nguez, G. Navarro, and A. Pacheco. An LMS-based grammar self-index with local consistency properties. In Proc. SPIRE, volume 12944 of LNCS, pages 100–113, 2021.
P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proc. FOCS, pages 390–398, 2000.
S. Gog, T. Beller, A. Moffat, and M. Petri. From theory to practice: Plug and play with succinct data structures. In Proc. SEA, volume 8504 of LNCS, pages 326–337, 2014.
S. Gog, J. Kärkkäinen, D. Kempa, M. Petri, and S. J. Puglisi. Fixed block compression boosting in FM-indexes: Theory and practice. Algorithmica, 81(4):1370–1391, 2019.
S. Kreft and G. Navarro. On compressing and indexing repetitive sequences. Theor. Comput. Sci., 483:115–133, 2013.
A. Moffat and R. Y. K. Isal. Word-based text compression using the Burrows-Wheeler transform. Inf. Process. Manag., 41(5):1175–1192, 2005.
G. Nong, S. Zhang, and W. H. Chan. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Computers, 60(10):1471–1484, 2011.
D. S. N. Nunes, F. A. da Louza, S. Gog, M. Ayala-Rincón, and G. Navarro. A grammar compression algorithm based on induced suffix sorting. In Proc. DCC, pages 42–51, 2018.
J. Sirén. Compressed suffix arrays for massive data. In Proc. SPIRE, volume 5721 of LNCS, pages 63–74, 2009.

電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文