帳號:guest(18.118.162.180)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):黃裕安
作者(外文):Huang, Yu-An
論文名稱(中文):有通配符的參數化字串比對問題
論文名稱(外文):Parameterized Pattern Matching with Wildcards
指導教授(中文):韓永楷
指導教授(外文):Hon, Wing-Kai
口試委員(中文):韓永楷
李哲榮
蔡孟宗
口試委員(外文):Hon, Wing-Kai
Lee, Che-Rung
Tsai, Meng-Tsung
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:106062586
出版年(民國):109
畢業學年度:108
語文別:英文
論文頁數:31
中文關鍵詞:文本索引字串比對近似字串比對有通佩符的字串比對參數化字串比對後綴樹勘誤字典樹Baker編碼空間壓縮
外文關鍵詞:Text indexingPattern matchingApproximate pattern matchingPattern matching with wildcardParameterized pattern matchingSuffix treeErrata trieBaker's encodingSpace compression
相關次數:
  • 推薦推薦:0
  • 點閱點閱:130
  • 評分評分:*****
  • 下載下載:10
  • 收藏收藏:0
在資訊科學中,字串比對是一個被充分研究的領域。它常常被應用在生物資訊當中,像是DNA序列的比對等等。由於DNA序列的長度可以是非常大的,所以在這個領域中,如何能壓縮空間是一個非常重要的課題。字串比對的問題有很多種變形,在這篇論文中,我們將探討其中有通配符的參數化字串比對問題,並且提出在時間空間上都有效率的解法。

我們將從只有一個通佩符的問題著手。我們首先介紹用通佩符樹的解法,它使用了O(nlogn)個字的空間,並且可以在O(p+occ)的時間內處理一個詢問字串(query pattern),其中n是文本(text)的長度、p是詢問字串的長度、occ是詢問字串在文本裡出現的次數。我們接著給出了空間壓縮的解法,它使用了O(nlog(sigma))個字的空間,並且可以在O(p(loglogn+log(sigma))+occlog(sigma))的時間內處理一個詢問,其中sigma是字母種類的數量。

接著,這個問題被延伸成有k個通佩符。我們首先建造出有k層的通佩符樹來解決這個問題,第一個解法用了O(nlog^k(n))個字的空間,並可以在O((2^k)(p+k^2)+occ)的時間內處理一個詢問。第二種解法壓縮了最後一層的通配符樹,使用O(nlog^(k-1)(n)log(sigma))個字的空間,並能在O((2^k)p(k+log(sigma))+occlog(sigma))的時間內處理一個詢問。
Pattern matching is a well-studied problem in Computer Science. It is frequently applied in bioinformatics, like the searching in DNA sequences whose lengths can be large, so compression of space becomes an important issue in this area. There are many variations of this problem. In this thesis, we focus on the problem of parameterized pattern matching with wildcards in pattern, and provide efficient indexing solutions to this problem.

We start with the 1-wildcard case. A solution using wildcard tree is introduced. It takes O(nlogn)-word indexing space and O(p+occ) query time, where n is the length of text, p is the length of pattern, and occ is the number of times that the pattern occurs in the text. We also give a second solution with compressed space, which uses O(nlog(sigma))-word indexing space with O(p(loglogn+log(sigma))+occlog(sigma)) query time, where sigma is the size of alphabet set.

Next, the problem is further extended to k-wildcards. Wildcard tree of k-layers is built to solve the problem. The first solution uses O(nlog^k(n))-word indexing space and can solve the problem with O((2^k)(p+k^2)+occ) query time. Similarly, we also provide a second solution with compressed space, which is obtained by compressing the last layer of the wildcard trees. It takes O(nlog^(k-1)(n)log(sigma))-word indexing space and each query takes O((2^k)p(k+log(sigma))+occlog(sigma)) time.
1 Introduction.........................................1
1.1 Problem Definition................................2
1.2 Baker’s Encoding and Parameterized Suffix Tree....4
2 AnO(nlogn)-word Index................................6
2.1 Handling Case 1...................................7
2.2 Handling Case 2...................................8
3 Compressing to anO(nlogσ)-word Index................10
3.1 Mapping..........................................10
3.1.1 Level 1 Mapping................................12
3.1.2 Level 2 Mapping................................12
3.2 Matching.........................................17
3.2.1 Proof of Lemma 4...............................19
4 Extension tok-PIWC..................................20
4.1 Naive Method.....................................21
4.2 Blind Search.....................................22
4.3 Compressing thek-th Layer........................25
5 Conclusion and Open Problems........................27
[1]A. Amir, G. M. Landau, M. Lewenstein, and D. Sokol. Dynamic text andstatic pattern matching.ACM Trans. Algorithms, 3(2):19, 2007.
[2]B. S. Baker. A theory of parameterized pattern matching: algorithms andapplications. InProceedings of the Twenty-Fifth Annual ACM Symposium onTheory of Computing, May 16-18, 1993, San Diego, CA, USA, pages 71–80,1993.
[3]M. Burrows and D. J. Wheeler. A block-sorting lossless data compressionalgorithm. Technical report, 1994.
[4]R. Cole, L. Gottlieb, and M. Lewenstein. Dictionary matching and indexingwith errors and don’t cares. InProceedings of the 36th Annual ACM Sympo-sium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages91–100, 2004.
[5]P. Ferragina and R. Grossi. The string b-tree: A new data structure for stringsearch in external memory and its applications.J. ACM, 46(2):236–280, 1999.
[6]P. Ferragina and G. Manzini. Indexing compressed text.J. ACM, 52(4):552–581, 2005.
[7]A. Ganguly, W. Hon, K. Sadakane, R. Shah, S. V. Thankachan, and Y. Yang.Space-efficient dictionaries for parameterized and order-preserving patternmatching. In27th Annual Symposium on Combinatorial Pattern Matching,CPM 2016, June 27-29, 2016, Tel Aviv, Israel, pages 2:1–2:12, 2016.
[8]A. Ganguly, W. Hon, and R. Shah. A framework for dynamic parameterizeddictionary matching. In15th Scandinavian Symposium and Workshops onAlgorithm Theory, SWAT 2016, June 22-24, 2016, Reykjavik, Iceland, pages10:1–10:14, 2016.
[9]A. Ganguly, R. Shah, and S. V. Thankachan. pBWT: Achieving succinctdata structures for parameterized pattern matching and related problems. InProceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on DiscreteAlgorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19,pages 397–407, 2017.
[10]R. Grossi and J. S. Vitter. Compressed suffix arrays and suffix trees with ap-plications to text indexing and string matching.SIAM J. Comput., 35(2):378–407, 2005.
[11]D. Harel and R. E. Tarjan. Fast algorithms for finding nearest commonancestors.SIAM J. Comput., 13(2):338–355, 1984.
[12]M. Jalsenius, B. Porat, and B. Sach. Parameterized matching in the streamingmodel. In30th International Symposium on Theoretical Aspects of ComputerScience, STACS 2013, February 27 - March 2, 2013, Kiel, Germany, pages400–411, 2013.
[13]D. E. Knuth, J. H. M. Jr., and V. R. Pratt. Fast pattern matching in strings.SIAM J. Comput., 6(2):323–350, 1977.
[14]M. Lewenstein, J. I. Munro, V. Raman, and S. V. Thankachan. Less space:Indexing for queries with wildcards.Theor. Comput. Sci., 557:120–127, 2014.
[15]U. Manber and G. Myers. Suffix arrays: A new method for on-line stringsearches. In D. S. Johnson, editor,Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, 22-24 January 1990, San Fran-cisco, California, USA, pages 319–327. SIAM, 1990.
[16]G. Navarro. Wavelet trees for all.J. Discrete Algorithms, 25:2–20, 2014.
[17]G. Navarro and K. Sadakane. Fully functional static and dynamic succincttrees.ACM Trans. Algorithms, 10(3):16:1–16:39, 2014.
[18]R. Pagh. Low redundancy in static dictionaries with constant query time.SIAM J. Comput., 31(2):353–363, 2001.
[19]R. Raman, V. Raman, and S. R. Satti. Succinct indexable dictionaries withapplications to encodingk-ary trees, prefix sums and multisets.ACM Trans.Algorithms, 3(4):43, 2007.
[20]K. Sadakane. Compressed text databases with efficient query algorithmsbased on the compressed suffix array. In D. T. Lee and S. Teng, editors,Algorithms and Computation, 11th International Conference, ISAAC 2000,Taipei, Taiwan, December 18-20, 2000, Proceedings, volume 1969 ofLectureNotes in Computer Science, pages 410–421. Springer, 2000.
[21]P. Weiner. Linear pattern matching algorithms. In14th Annual Symposiumon Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17,1973, pages 1–11. IEEE Computer Society, 1973.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *