帳號:guest(3.128.95.199)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):吳琮偉
作者(外文):Wu, Tsung-Wei
論文名稱(中文):根據最少配對模式解決scaffolding問題的啟發式演算法
論文名稱(外文):A Heuristic Algorithm for Solving Scaffolding Problem Based on Exemplar Model
指導教授(中文):盧錦隆
指導教授(外文):Lu, Chin-Lung
口試委員(中文):邱顯泰
林苕吟
口試委員(外文):Chiu, Hsien-Tai
Lin, Tiao-Yin
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系所
學號:105062631
出版年(民國):108
畢業學年度:107
語文別:中文
論文頁數:70
中文關鍵詞:基因體組裝啟發式演算法重組距離最少配對模型
外文關鍵詞:contig scaffoldingheuristic algorithmexemplar modelbreakpoint distancerearrangement-based approach
相關次數:
  • 推薦推薦:0
  • 點閱點閱:94
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
目前利用次世代定序定序技術得到的short reads來組裝一個基因體通常會得到一堆contigs的集合,這些contigs即物種本身的基因體草圖 (draft genome) 。然而基因體草圖的內容都是尚未決定次序與方向的contigs,仍然需要經過scaffolding的步驟決定基因體草圖內容的骨架 (scaffolds) 作為還原基因體內容的前置步驟。過去我們實驗室已發展出一種rearrangement-based的scaffolding工具CSAR可利用完整 (complete) 或不完整 (incomplete)的參考基因體 (reference genome) 來決定目標基因體草圖 (target draft genome) 裡contigs的次序與方向。特別的是,在參考基因體與目標基因體之間的保守序列標記 (conserved sequence markers) 能被CSAR使用的只有不重覆的序列標記(singleton markers)。然而,因區段複製(segmental duplication)事件所引起的重複的序列標記 (duplicate markers) 在基因體中是常被發現的。因此,設計出一個方法讓CSAR能夠同時處理不重覆與重覆序列標記使得其準確度的表現能進一步地提升是件有趣的研究。在此研究中,我們先利用最少配對模型(exemplar model)讓重複的序列標記各自只留下一份代表使得最後剩餘的序列標記全部都是不重覆的序列標記。接著,我們再去應用CSAR根據參考基因體來決定出目標基因體草圖的骨架。最後,根據模擬與實際細菌與植物資料集的實驗結果,相較於原來的rearrangement-based scaffolding方法作用在沒有重複序列標記的基因體或利用MUMmer來處理重複序列標記,我們的啟發式的scaffolding演算法有比較好的正確性表現。
Assembling a genome from short reads currently obtained by next-generation sequencing techniques often results in a collection of contigs. These contigs are draft gnome of species. The order and orientation of these contigs are unknown and they still require being ordered and oriented to decide their scaffolds before getting the complete genome of species. Previously, our laboratory has already developed a rearrangement-based scaffolding tool CSAR that can utilize a complete or incomplete reference genome to scaffold a target draft genome. In particular, the conserved sequence markers between the target and reference genomes used by CSAR are required to be singletons. However, the duplicate markers (caused by segmental duplication) are commonly observed in genome sequences. Therefore, it is an interesting study to design an approach that can allow CSAR to deal with both singleton and duplicate markers such that its accuracy performance can be further improved. In this study, we first utilize the exemplar model to delete all but one marker from each family of duplicate markers such that the remaining markers are all singleton. Next, we apply CSAR to scaffold the target genome based on the reference genome. Finally, according to the experimental results on simulated and real bacterial and plant datasets, our heuristic scaffolding algorithm has better accuracy performance when compared to the original rearrangement-based algorithm acting on the genomes without duplicate markers or with duplicate markers that are processed by MUMmer.
中文摘要 1
Abstract 3
Acknowledgement 4
Contents 5
List of figures 7
List of tables 10
Chapter 1 Background 14
Chapter 2 Methods 17
2.1 Overview of CSAR 18
2.1.1 Marker, Adjacency, Telomere 18
2.1.2 Adjacency Graph, Fusion 19
2.1.3 Good Path, Good Fusion 20
2.1.4 Limitation of CSAR 21
2.2 Identification of sequence markers 22
2.2.1 MUMmer 22
2.2.2 Sibelia 22
2.3 Exemplar Model 24
2.3.1 Breakpoint Distance 24
2.3.2 Problem Statement 26
2.3.3 ILP Formulation 27
Chapter 3 Experiment Results and Discussion 31
3.1 Quality Metrics 31
3.2 Experiments of Simulation 34
3.2.1 Flowchart of Simulation 34
3.2.2 Settings of Simulation 36
3.2.3 Results of Simulation 37
3.3 Experiments of Real Datasets 46
3.3.1 Calculation of Duplication Ratio 47
3.3.2 Method of Cutting Specific Contigs 49
3.3.3 Settings of Used Tools 50
3.3.4 Testing Datasets 51
3.3.5 Results of Real Datasets 55
Chapter 4 Conclusion 67
References 68
[1] S. Assefa, T.M. Keane, T.D. Otto, C. Newbold and M. Berriman (2009) ABACAS algorithm-based automatic contiguation of assembled sequences. Bioinformatics, 25:1968–1969.
[2] M. Galardini, E.G. Biondi, M. Bazzicalupo and A. Mengoni (2011) CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes. Source Code for Biology and Medicine, 6:11.
[3] P. Husemann and J. Stoye (2010) r2cat: synteny plots and comparative assembly. Bioinformatics, 26:570–571.
[4] D.C. Richter, S.C. Schuster and D.H. Huson (2007) OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics, 23, 1573–1579.
[5] A.I. Rissman, B. Mau, B.S. Biehl, A.E. Darling, J.D. Glasner and Perna N.T. (2009): Reordering contigs of draft genomes using the Mauve Aligner. Bioinformatics, 25:2071–2073.
[6] S.A. van Hijum, A.L. Zomer, O.P. Kuipers and J. Kok (2005) Projector 2 contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Research, 33:W560–W566.
[7] Z. Dias, U. Dias and J.C. Setubal ( 2012) SIS: a program to generate draft genome sequence scaffolds for prokaryotes. BMC Bioinformatics, 13:96.
[8] C. L. Lu, K.-T. Chen, S.-Y. Huang and H.-T. Chiu. (2014) CAR: contig assembly of prokaryotic draft genomes using rearrangements, BMC Bioinformatics, 15, 381.
[9] C.L. Lu (2015) An Efficient Algorithm for the contigs Ordering Problem under Algebraic Rearrangement Distance. Journal of Computational Biology, 22, 975-987
[10] K. T. Chen, C. L. Liu, S. H. Huang, H. T. Shen, Y. K. Shieh, H. T. Chiu and C. L. Lu, CSAR: a contig scaffolding tool using algebraic rearrangements, Bioinformatics, 34 (2018) 109-111.
[11] J. Bailey, and E. Eichler (2006) Primate segmental duplication: crucibles of evolution, diversity and disease. Nature Reviews Genetics, 7, 552-564.
[12] M. Lynch, 2007. The Origins of Genome Architecture. Sinauer, Sunderland, MA.
[13] D. Sankoff, 1999. Genome rearrangement with gene families. Bioinformatics 15, 909-917.
[14] D. Bryant, 2000. The complexity of calculating exemplar distances. In Sankoff, D., and Nadeau, J., eds. Comparative Genomics, volume 1 of Computational Biology. Kluwer Academic Publishers, Dordrecht.
[15] M. Shao, and B. Moret, 2015. A fast and exact algorithm for the exemplar breakpoint distance, 309-322. In Proceedings of the 19th International Conference on Computational Molecular Biology (RECOMB’15), volume 9029 of Lecture Notes in Computer Science.
[16] A. Gurevich et al. (2013) Quast: quality assessment tool for genome assemblies. Bioinformatics, 29, 1072-1075
[17] A.L. Delcher et al. (1999) Alignment of Whole Genomes. Nucleic Acids Research, 27:11, 2369-2376.
[18] Minkin et al. (2013) Sibelia: A scalable and comprehensive synteny block generation tool for closely related microbial genomes. arXiv preprint arXiv: 1307.7941.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *