帳號:guest(3.144.29.38)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):陳昆澤
作者(外文):Chen, Kun-Tze
論文名稱(中文):利用基因體重組決定序列片段的次序與方向
論文名稱(外文):Scaffolding Contigs Using Genome Rearrangements
指導教授(中文):盧錦隆
指導教授(外文):Lu, Chin-Lung
口試委員(中文):李家同
唐傳義
邱顯泰
林苕吟
口試委員(外文):Lee, Chia-Tung
Tang, Chuan-Yi
Chiu, Hsien-Tai
Lin, Tiao-Yin
學位類別:博士
校院名稱:國立清華大學
系所名稱:資訊工程學系所
學號:102062808
出版年(民國):107
畢業學年度:106
語文別:英文
論文頁數:93
中文關鍵詞:演算法生物資訊基因體序列片段支架定序與定向
外文關鍵詞:AlgorithmBioinformaticsGenomeContigScaffoldOrdering and orientation
相關次數:
  • 推薦推薦:0
  • 點閱點閱:381
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
DNA定序技術的持續進步使得大量的基因體草圖能以較低的成本被快速地定序出來。然而這些基因體草圖通常僅是一些被部分定序成未組裝的序列片段 (又稱 contigs) 的集合,這些 contigs彼此之間的位置和方向在原本的基因體上仍是未知的。因為這樣的不完整性,這些 contigs 便無法直接地被目前現存的演算法用以研究基因體上的結構變異、演化關係的重建以及其它生物學上的應用。為了得到基因體草圖上更完整的序列,其 contigs 需要被準確地定序和定向成更長且包含空隙的序列,稱為 scaffolds,如此在其 contigs 之間的空隙便可以被後續的空隙閉合程序給正確地填補。Scaffolding 方法的其中一種為所謂參考式的 scaffolding,其是根據參考基因體來定序和定向基因體草圖中的 contigs。本論文中,我們首先把參考單一完整基因體的 scaffolding 定義為一個在考慮反轉以及區塊互換距離之下的單邊區塊 (或 contig) 排序問題。我們利用代數學中的排列群設計了一個有效率的演算法能在 O(δn) 的時間內來解決單邊區塊排序問題,其中 n 是基因 (或基因片段) 個數,而 δ 是所使用到的反轉和區塊互換的次數。此外,我們已經將這演算法開發出一個稱為 CAR 的 scaffolding 工具,它可以根據一個完整的參考基因體來有效率且更加準確地 scaffold 目標基因體草圖中的 contigs。在模擬和實際的資料上,我們的實驗結果顯示 CAR 在敏感度、精準度以及基因體覆蓋率的評估標準上確實比許多其它類似為參考單一完整基因體的 scaffolding 工具有更好的表現。

另一方面,如果目標基因體與參考基因體之間有著重組事件或者是它們之間的演化關係較為遙遠,則參考單一完整基因體的 scaffolding 工具可能會產生錯誤的 scaffolds。這可能意謂著單一個參考基因體並不足以產生基因體草圖正確的 scaffolds。因此,我們設計出一個啟發式演算法來進一步改良參考單一完整基因體的 scaffolding 工具 CAR 成為另一個新的工具 Multi-CAR,它可以利用多個完整的參考基因體對於目標基因體草圖中的 contigs 進行更準確的 scaffolding。我們的實驗結果在使用多個完整參考基因體的實際資料上顯示,Multi-CAR 在敏感度、精準度、基因體覆蓋率、scaffold 個數以及 scaffold N50 的數值上皆勝過其它兩個參考多個基因體的 scaffolding 工具 Ragout 和 MeDuSa。

在實際的使用上,完整的參考基因體並非總能被取得用來 scaffold 一個基因體草圖。因此,我們繼續將 CAR 以及 Multi-CAR 改進成新的單參考式及多參考式的 scaffolding 工具 CSAR 以及 Multi-CSAR,其各別可以使用單一個非完整的以及多個非完整的基因體來作為參考,用以 scaffold 目標基因體草圖。最後實際資料的實驗結果顯示 CSAR 以及 Multi-CSAR 在多個準確度的評估標準上,各別都表現得比其它類似的 scaffolding 工具來得好。此外,為了方便使用以及視覺化地驗證 scaffolding 的結果,我們開發出了 CSAR 的網頁服務版本 CSAR-web,它提供了使用者一個操作簡易的介面來執行 CSAR 並將 scaffolding 的結果以圖形化的方式來呈現。
Continuing advances in DNA sequencing allow an increasing number of draft genomes to be produced rapidly in a decreasing cost. However, these draft genomes usually are just partially sequenced as collections of unassembled contigs whose relative positions and orientations along the genome being sequenced are still unknown. Due to the incompleteness, these contigs cannot be used directly by currently existing algorithms for studying their genome structural variation, phylogeny reconstruction, and other biological applications. To obtain a more complete sequence of a draft genome, its contigs are needed to be accurately ordered and oriented into larger gap-containing sequences, called scaffolds, so that the gaps between scaffolded contigs can be correctly filled in the subsequent gap-closing process. One of scaffolding approaches is the so-called reference-based scaffolding, which is to scaffold the contigs of a draft genome based on reference genomes. In this thesis, we first formulate the single complete reference-based scaffolding as one-sided block (or contig) ordering problem under weighted reversal and block-interchange distance. By using permutation groups in algebra, we design an efficient algorithm to solve this one-sided block ordering problem in O(δn) time, where n is the number of genetic markers (or genes) and δ is the number of used reversals and block-interchanges. In addition, we have developed this algorithm into a scaffolding tool called CAR that can efficiently and more accurately scaffold the contigs of a target draft genome based on a complete reference genome. Our experimental results on simulated and real datasets have also shown that CAR indeed performs better than many other similar single reference-based scaffolding tools in terms of sensitivity, precision and genome coverage.

On the other hand, single reference-based scaffolding tools may produce erroneous scaffolds if there are rearrangements between the target and reference genomes or their phylogenetic relationship is distant. This may suggest that a single reference genome may not be sufficient to produce correct scaffolds of a draft genome. Therefore, we design a heuristic method to further revise our single reference-based scaffolding tool CAR into a new one called Multi-CAR that can utilize multiple complete genomes as references to more accurately scaffold the contigs of a draft genome. Our experimental results on real datasets with multiple complete reference genomes have shown that Multi-CAR outperforms other two multiple reference-based scaffolding tools Ragout and MeDuSa in terms of sensitivity, precision, genome coverage, scaffold number and scaffold N50 size.

In practical usage, complete reference genomes are not always available for a draft genome to be scaffolded. Therefore, we continue to improve CAR and Mutli-CAR into new single and multiple reference-based scaffolding tools CSAR and Multi-CSAR that can respectively take single and multiple incomplete genomes as references to scaffold a target draft genome. Our experimental results on real datasets have finally shown that CSAR and Multi-CSAR respectively outperforms other similar single and multiple reference-based scaffolding tools in terms of many accuracy metrics. In addition, for the convenient usage and the visual validation of scaffolding results, we have developed a web server version of CSAR called CSAR-web that provides the users with an easy-to-operate interface to run CSAR and outputs its scaffolding result in a graphical mode.
Contents

中文摘要 i

Abstract iii

Acknowledgements v

List of Tables ix

List of Figures xi

1 Introduction 1

2 Scaffolding contigs in draft genomes using reversals and block-interchanges 8

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 One-sided block ordering problem . . . . . . . . . . . . . . . . 9

2.2.2 Permutation groups . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 A model for representing DNA molecules . . . . . . . . . . . . 12

2.2.4 An efficient algorithm for the one-sided block ordering problem 13

2.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 CAR: contig assembly of prokaryotic draft genomes using rearrangements 32

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.2 Basic idea of algorithm . . . . . . . . . . . . . . . . . . . . . . 33

3.2.3 Usage of CAR . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 Testing dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.2 Comparisons on sensitivity and precision . . . . . . . . . . . . 38

3.3.3 Comparison on genome coverage . . . . . . . . . . . . . . . . 42

3.3.4 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 Multi-CAR: a tool of contig scaffolding using multiple references 45

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 Method of Multi-CAR . . . . . . . . . . . . . . . . . . . . . . 47

4.2.2 Usage of Multi-CAR . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.1 Testing dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.2 Comparisons on sensitivity and precision . . . . . . . . . . . . 51

4.3.3 Comparison on coverage, scaffold number and N50 . . . . . . . 52

4.3.4 Comparison on running time . . . . . . . . . . . . . . . . . . . 55

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5 CSAR: a contig scaffolding tool using algebraic rearrangements 57

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3.1 Testing datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3.2 Results of the datasets . . . . . . . . . . . . . . . . . . . . . . 62

5.3.3 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6 CSAR-web: a web server of contig scaffolding using algebraic rearrange-
ments 74

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.2 Web interface and usage . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7 Multi-CSAR: a multiple reference-based contig scaffolder using algebraic
rearrangements 81

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8 Conclusions and future works 87

Bibliography 89
[1] E. R. Mardis, The impact of next-generation sequencing technology on genetics, Trends in Genetics, 24 (2008) 133–141.

[2] J. Shendure and H. Ji, Next-generation DNA sequencing, Nature Biotechnology, 26 (2008) 1135–1145.

[3] M. L. Metzker, Sequencing technologies - the next generation, Nature Reviews Genetics, 11 (2010) 31–46.

[4] M. Pop, Genome assembly reborn: recent computational challenges, Briefings in Bioinformatics, 10 (2009) 354–366.

[5] M. Hunt, C. Newbold, M. Berriman and T. D. Otto, A comprehensive evaluation of assembly scaffolding tools, Genome Biology, 15 (2014) R42.

[6] M. Pop, D. S. Kosack and S. L. Salzberg, Hierarchical scaffolding with Bambus, Genome Research, 14 (2004) 149–159.

[7] A. Dayarian, T. P. Michael and A. M. Sengupta, SOPRA: Scaffolding algorithm for paired reads via statistical optimization, BMC Bioinformatics, 11 (2010) 345.

[8] M. Boetzer, C. V. Henkel, H. J. Jansen, D. Butler and W. Pirovano, Scaffolding pre-assembled contigs using SSPACE, Bioinformatics, 27 (2011) 578–579.

[9] D. H. Huson, K. Reinert and E. W. Myers, The greedy path-merging algorithm for Contig Scaffolding, Journal of the ACM , 49 (2002) 603–615.

[10] D. R. Bentley, Whole-genome re-sequencing, Current Opinion in Genetics & Development, 16 (2006) 545–552.

[11] S. A. F. T. van Hijum, A. L. Zomer, O. P. Kuipers and J. Kok, Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies, Nucleic Acids Research, 33 (2005) W560–W566.

[12] D. C. Richter, S. C. Schuster and D. H. Huson, OSLay: optimal syntenic layout of unfinished assemblies, Bioinformatics, 23 (2007) 1573–1579.

[13] S. Assefa, T. M. Keane, T. D. Otto, C. Newbold and M. Berriman, ABACAS: algorithm-based automatic contiguation of assembled sequences, Bioinformatics, 25 (2009) 1968–1969.

[14] A. I. Rissman, B. Mau, B. S. Biehl, A. E. Darling, J. D. Glasner and N. T. Perna, Reordering contigs of draft genomes using the Mauve Aligner, Bioinformatics, 25 (2009) 2071–2073.

[15] P. Husemann and J. Stoye, r2cat: synteny plots and comparative assembly, Bioinformatics, 26 (2010) 570–571.

[16] M. Galardini, E. G. Biondi, M. Bazzicalupo and A. Mengoni, CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes, Source Code for Biology and Medicine, 6 (2011) 11.

[17] A. Muñoz, C. Zheng, Q. Zhu, V. A. Albert, S. Rounsley and D. Sankoff, Scaffold filling, contig fusion and comparative gene order inference, BMC Bioinformatics, 11 (2010) 304.

[18] Z. Dias, U. Dias and J. C. Setubal, SIS: a program to generate draft genome sequence scaffolds for prokaryotes, BMC Bioinformatics, 13 (2012) 96.

[19] G. Fertin, A. Labarre, I. Rusu, S. Vialette and E. Tannier, Combinatorics of genome rearrangements, The MIT Press, Cambridge, Massachusetts, 2009.

[20] S. Hannenhalli and P. A. Pevzner, Transforming cabbage into turnip: Polynomial algorithm for sorting signed permutations by reversals, Journal of the ACM , 46 (1999) 1–27.

[21] H. Kaplan, R. Shamir and R. E. Tarjan, A faster and simpler algorithm for sorting signed permutations by reversals, SIAM Journal on Computing, 29 (2000) 880–892.

[22] E. Tannier, A. Bergeron and M. F. Sagot, Advances on sorting by reversals, Discrete Applied Mathematics, 155 (2007) 881–888.

[23] V. Bafna and P. A. Pevzner, Sorting by transpositions, SIAM Journal on Discrete Mathematics, 11 (1998) 224–240.

[24] D. A. Christie, Sorting permutations by block-interchanges, Information Processing Letters, 60 (1996) 165–169.

[25] Y. C. Lin, C. L. Lu, H. Y. Chang and C. Y. Tang, An efficient algorithm for sorting by block-interchanges and its application to the evolution of vibrio species, Journal of Computational Biology, 12 (2005) 102–112.

[26] S. Hannenhalli and P. A. Pevzner, Transforming men into mice (polynomial algorithm for genomic distance problem), in: Proceedings of the 36th IEEE Symposium on Foundations of Computer Science, FOCS1995, IEEE Computer Society, 1995, pp. 581–592.

[27] C. L. Lu, Y. L. Huang, T. C. Wang and H. T. Chiu, Analysis of circular genome rearrangement by fusions, fissions and block-interchanges, BMC Bioinformatics, 7 (2006) 295.

[28] S. Hannenhalli, Polynomial-time algorithm for computing translocation distance between genomes, Discrete Applied Mathematics, 71 (1996) 137–151.

[29] A. Bergeron, J. Mixtacki and J. Stoye, On sorting by translocations, Journal of Computational Biology, 13 (2006) 567–578.

[30] Y. L. Huang and C. L. Lu, Sorting by reversals, generalized transpositions, and translocations using permutation groups, Journal of Computational Biology, 17 (2010) 685–705.

[31] Y. L. Huang, C. C. Huang, C. Y. Tang and C. L. Lu, SoRT2: a tool for sorting genomes and reconstructing phylogenetic trees by reversals, generalized transpositions and translocations, Nucleic Acids Research, 38 (2010) W221–W227.

[32] E. Gaul and M. Blanchette, Ordering partially assembled genomes using gene arrangements, Lecture Notes in Computer Science, 4205 (2006) 113–128.

[33] G. Bourque and P. A. Pevzner, Genome-scale evolution: Reconstructing gene orders in the ancestral species, Genome Research, 12 (2002) 26–36.

[34] C. L. Li, K. T. Chen and C. L. Lu, Assembling contigs in draft genomes using reversals and block-interchanges, BMC Bioinformatics, 14 Suppl 5 (2013) S9.

[35] C. L. Lu, K. T. Chen, S. Y. Huang and H. T. Chiu, CAR: contig assembly of prokaryotic draft genomes using rearrangements, BMC Bioinformatics, 15 (2014) 381.

[36] K. T. Chen, C. J. Chen, H. T. Shen, C. L. Liu, S. H. Huang and C. L. Lu, Multi-CAR: a tool of contig scaffolding using multiple references, BMC Bioinformatics, 17 (2016) 469.

[37] K. T. Chen, C. L. Liu, S. H. Huang, H. T. Shen, Y. K. Shieh, H. T. Chiu and C. L. Lu, CSAR: a contig scaffolding tool using algebraic rearrangements, Bioinformatics, 34 (2018) 109–111.

[38] K. T. Chen and C. L. Lu, CSAR-web: a web server of contig scaffolding using algebraic rearrangements, Nucleic Acids Research, 46 (2018) W55–W59.

[39] K. T. Chen, H. T. Shen and C. L. Lu, Multi-CSAR: a multiple reference-based contig scaffolder using algebraic rearrangements, Genome Informatics Workshop 2018, in revision.

[40] D. C. Koboldt, L. Ding, E. R. Mardis and R. K. Wilson, Challenges of sequencing human genomes, Briefings in Bioinformatics, 11 (2010) 484–498.

[41] M. Blanchette, T. Kunisawa and D. Sankoff, Parametric genome rearrangement, Gene, 172 (1996) GC11–GC17.

[42] S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C. Antonescu and S. L. Salzberg, Versatile and open software for comparing large genomes, Genome Biology, 5 (2004) R12.

[43] G. Tesler, Efficient algorithms for multichromosomal genome rearrangements, Journal of Computer and System Sciences, 65 (2002) 587–609.

[44] M. Kolmogorov, B. Raney, B. Paten and S. Pham, Ragout-a reference-assisted assembly tool for bacterial genomes, Bioinformatics, 30 (2014) i302–i309.

[45] E. Bosi, B. Donati, M. Galardini, S. Brunetti, M. F. Sagot, P. Lio, P. Crescenzi, R. Fani and M. Fondi, MeDuSa: a multi-draft based scaffolder, Bioinformatics, 31 (2015) 2443–2451.

[46] V. Kolmogorov, Blossom V: a new implementation of a minimum cost perfect matching algorithm, Mathematical Programming Computation, 1 (2009) 43–67.

[47] S. L. Salzberg, A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen, M. C. Schatz, A. L. Delcher, M. Roberts, G. Marcais, M. Pop and J. A. Yorke, GAGE: A critical evaluation of genome assemblies and assembly algorithms, Genome Research, 22 (2012) 557–567.

[48] C. L. Lu, An Efficient Algorithm for the Contig Ordering Problem under Algebraic Rearrangement Distance, Journal of Computational Biology, 22 (2015) 975–987.

[49] P. Feijão and J. Meidanis, Extending the algebraic formalism for genome rearrangements to include linear chromosomes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10 (2013) 819–831.

[50] T. H. Cormen, C. Leiserson, R. Rivest and C. Stein, Introduction to algorithms, The MIT Press, Cambridge, Massachusetts, 3rd edition, 2009.

[51] A. Gurevich, V. Saveliev, N. Vyahhi and G. Tesler, QUAST: quality assessment tool for genome assemblies, Bioinformatics, 29 (2013) 1072–1075.

[52] I. Pagani, K. Liolios, J. Jansson, I. M. A. Chen, T. Smirnova, B. Nosrat, V. M. Markowitz and N. C. Kyrpides, The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata, Nucleic Acids Research, 40 (2012) D571–D579.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *