帳號:guest(18.223.170.223)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):許書維
作者(外文):Hsu, Shu-Wei
論文名稱(中文):根據最多配對模式解決Scaffolding問題之改進演算法
論文名稱(外文):An Improved Algorithm for Solving Scaffolding Problem Based on Maximum-Matching Model
指導教授(中文):盧錦隆
指導教授(外文):Lu, Chin-Lung
口試委員(中文):邱顯泰
林苕吟
林沿妊
口試委員(外文):Chiu, Hsien-Tai
Lin, Tiao-Yin
Lin, Yen-Jen
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:107065531
出版年(民國):109
畢業學年度:108
語文別:中文
論文頁數:75
中文關鍵詞:演算法基因體組裝最多配對模式整數線性規劃次世代定序生物資訊
外文關鍵詞:algorithmscaffolding problemmaximum-matching modelinteger linear programmingbioinformaticsnext generation sequencing
相關次數:
  • 推薦推薦:0
  • 點閱點閱:437
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
Scaffolding是定序DNA的其中一個過程,其目的是為了將目標基因草圖 (target draft genome) 中的contigs方向以及次序決定出來。先前,我們實驗室有根據rearrangement-based方法發展出一個scaffolding工具CSAR,CSAR可以根據參考基因體 (reference genome) 上的資訊來scaffold目標基因草圖。但是CSAR有個主要的限制就是輸入的目標與參考基因體之間保守的序列標記 (conserved sequence markers) 必須是不重複的。事實上,重複的標記 (duplicate markers) 在物種的基因體之間是非常常見的。因此,我們實驗室在2019年定義出一個MBD-based scaffolding problem,此問題的目的是要去決定目標與參考基因體之間的骨架 (scaffolds),並且使得這兩個骨架之間的maximum-matching breakpoint distance為最小。同時,我們的實驗室利用integer linear programming (ILP) 設計出一個演算法來解決MBD-based scaffolding problem。實驗的結果也顯示有考慮重複標記的MBD-based scaffolding algorithm在一些細菌資料集有不錯的表現。然而,過長的執行時間使得此演算法的測試資料僅侷限於較小的細菌基因體。因此,在本篇論文中,我們透過將先前ILP演算法中的變數 (variables) 種類減少,以及新增兩項限制式 (constraints) 來使得演算法的執行時間大幅降低。除此之外,在最後的人工與真實資料實驗結果中顯示出我們改良後的演算法目前已可有效率地scaffold細菌與真核生物的基因體,並且在提升執行效率的狀況下還能維持相當不錯的準確度。
Scaffolding is one of the processes of DNA sequencing. Its purpose is to determine orientations and orders of the contigs in a target draft genome. Previously, our laboratory has already developed a rearrangement-based scaffolding tool CSAR. CSAR can scaffold a target draft genome based on the information of the reference genome. But the main limitation of CSAR is that the conserved sequence markers between the target genome and the reference genome must be duplicate-free. In fact, duplicate markers are very common among the genomes of species. Therefore, in 2019, our laboratory defined an MBD-based scaffolding problem, which is to determine the scaffolds of the target and reference genomes such that the maximum-matching breakpoint distance between the resulting scaffolds is minimized. At the same time, our laboratory used integer linear programming (ILP) to design an algorithm to solve the MBD-based scaffolding problem. The experimental results had shown that the MBD-based scaffolding algorithm with considering duplicate markers performed well on some bacterial datasets. However, the running time of this algorithm was high and hence the testing datasets were limited to some smaller bacterial genomes. Therefore, in this thesis, we reduce the types of variables in the previous ILP scaffolding algorithm and add two new constraints to greatly reduce the running time of the algorithm. In addition, our experimental results on artificial and real contig datasets with eukaryotic genome have shown that our improved ILP scaffolding algorithm can efficiently scaffold the contigs of the bacterial and eukaryotic genomes by maintaining a fairly good accuracy while reducing the running time.
中文摘要.....1
Abstract.....3
Acknowledgement.....5
Contents.....6
List of Figures.....8
List of Tables.....12
Chapter 1 Introduction.....17
Chapter 2 Methods.....23
2.1 Preliminaries.....23
2.1.1 Genome, Contig and Marker.....23
2.1.2 Adjacency and Pair of Shared Adjacencies.....25
2.1.3 Breakpoint and Breakpoint Distance.....25
2.1.4 Matching, Maximum-Matching and Maximum- Matching Model.....26
2.1.5 Potential Adjacency and Pair of Shared Potential Adjacencies.....27
2.1.6 Extended Potential Adjacency and Extended Pair of Shared Potential Adjacencies.....29
2.2 Flowchart of Our Algorithm.....30
2.3 ILP Formulations.....31
2.3.1 ILP Variables.....31
2.3.2 ILP Constraints.....32
2.3.3 ILP Objective Function.....36
Chapter 3 Experiment Results and Discussion.....37
3.1 Quality Metrics.....37
3.2 Experiments of Artificial Contig Data.....39
3.2.1 Settings of Used Tools.....40
3.2.2 Artificial Contig Datasets.....41
3.2.3 Results of Artificial Contig Datasets.....45
3.2.4 Discussion.....56
3.3 Experiments of Real Contig Datasets.....58
3.3.1 Settings of Used Tools.....59
3.3.2 Real Contig Datasets.....59
3.3.3 Results of Real Contig Datasets.....62
3.3.4 Discussion.....70
Chapter 4 Conclusion.....72
References.....73
[1] Z. Dias, U. Dias and J.C. Setubal (2012) SIS: a program to generate draft genome sequence scaffolds for prokaryotes. BMC Bioinformatics, 13, 96.
[2] C.L. Li, K.T. Chen, C.L. Lu (2013) Assembling contigs in draft genomes using reversals and block-interchanges.BMC Bioinformatics, 14, S9.
[3] C.L. Lu, K.T. Chen, S.Y. Huang and H.T. Chiu (2014) CAR: contig assembly of prokaryotic draft genomes using rearrangements.BMC Bioinformatics, 15, 381.
[4] C.L. Lu (2015) An efficient algorithm for the contigs ordering problem under algebraic rearrangement distance. Journal of Computational Biology, 22, 975–987.
[5] K.T. Chen, C.L. Liu, S.H. Huang, H.T. Shen, Y.K. Shieh, H.T. Chiu and C.L. Lu (2018) CSAR: a contig scaffolding tool using algebraic rearrangements, Bioinformatics, 34, 109–111.
[6] S. Assefa, T.M. Keane, T.D. Otto, C. Newbold and M. Berriman (2009) ABACAS algorithm-based automatic contiguation of assembled sequences. Bioinformatics, 25, 1968–1969.
[7] M.Galardini, E.G. Biondi, M. Bazzicalupo and A. Mengoni (2011) CONTIGuator: abacterial genomes finishing tool for structural insights on draft genomes. Source Code for Biology and Medicine, 6, 11.
74
[8] P. Husemann and J. Stoye (2010) r2cat: synteny plots and comparative assembly. Bioinformatics, 26, 570–571.
[9] D.C. Richter, S.C. Schuster and D.H. Huson (2007) OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics, 23, 1573–1579.
[10] A.I. Rissman, B. Mau, B.S. Biehl, A.E. Darling, J.D. Glasner and N.T. Perna (2009) Reordering contigs of draft genomes using the Mauve Aligner. Bioinformatics, 25, 2071–2073.
[11] S.A.van Hijum, A.L. Zomer, O.P. Kuipers and J. Kok (2005) Projector 2 contig mappingfor efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Research, 33, 560–566.
[12] M.Shao and B. Moret (2016) A fast and exact algorithm for the exemplar breakpoint distance. Journalof Computational Biology,23, 337–346. [13] M.Shao and B. Moret (2017) On computing breakpoint distances for genomes with duplicate genes. Journalof Computational Biology, 24, 571–580.
[14] Y.H. Chen (2019) The Study of Solving Scaffolding Problem Based on Exemplar Model. Thesis, National Tsing Hua University, Taiwan.
[15] I.H. Kao (2019) The Study of Solving Scaffolding Problem Based on Maximum- Matching Model. Thesis, National Tsing Hua University, Taiwan.
[16] D.Y. Peng (2020) The Study of Solving Scaffolding Problem Based on Exemplar Model. Thesis, National Tsing Hua University, Taiwan.
75
[17] I. Minkin, A. Patel, M. Kolmogorov, N. Vyahhi and S. Pham (2013) Sibelia: A scalable and comprehensive synteny block generation tool for closely related microbial genomes. In, International Workshop on Algorithms in Bioinformatics, Springer, 215–229.
[18] A. Gurevich, V. Saveliev, N. Vyahhi, G. Tesler (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072-1075. [19] M. Hunt, C. Newbold, M. Berriman et al. (2014) A comprehensive evaluation of assembly scaffolding tools. Genome Biol, 15, R42
(此全文20250825後開放外部瀏覽)
電子全文
中英文摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *