帳號:guest(3.139.86.160)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):黃信恩
作者(外文):Huang, Hsin-En
論文名稱(中文):利用稀疏負二項分配之線性判別分類器分析基因表現測序資料
論文名稱(外文):Sparse Negative Binomial Linear Discriminant Analysis through generalize linear model for RNA-seq data
指導教授(中文):謝文萍
指導教授(外文):HSIEH, WEN-PING
口試委員(中文):曾建城
盧鴻興
張中
學位類別:碩士
校院名稱:國立清華大學
系所名稱:統計學研究所
學號:106024521
出版年(民國):108
畢業學年度:107
語文別:英文
論文頁數:42
中文關鍵詞:負二項分配線性判別基因表現測序
外文關鍵詞:RNA-seqgeneralizelinearmodelLinearDiscriminantAnalysis
相關次數:
  • 推薦推薦:0
  • 點閱點閱:61
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
近年來, 由於次世代定序的技術發展崛起,核糖核酸測序 (RNA測序) 更高的精確性逐漸取代DNA微陣列成為表達生物基因測序的主流方法。其中,若能藉由病患之RNA測序有效分類其各項特徵,必定能提升醫療診斷時的相應資訊。 然而現況下大部分的統計方法皆是建立在連續型分布抑或是常態分配的假設下。導因於測量方式的差異,RNA測序和DNA微陣列所測得的基因表現量並非相同屬性。前者測量值皆為非負整數值,在資料分析時通常以卜瓦松分配或負二項分配做為分配假設;而後者則為連續型的測量值,一般以常態分配進行建模。故此發展針對卜瓦松分配或負二項分配作最為建模的分析方法是為現階段不可忽略的需要性。Witten (2011) 曾提出藉由卜瓦松分配的假設,改善原有常態分配假設下的線性判別分析。但在卜瓦松分配假設下,人們需要假設母體變異數和母體平均數是相等的,這並不能有效地體現RNA測序資料背後的生物特性。Dong (2016) 接著在Witten (2011)的方法進行延伸,將其原有的卜瓦松分配假設更改為負二項分配,藉此讓變異數的假設更加彈性。然而在測序資料中,變項個數大多是遠大於樣本個數的,所以變項挑選的選模機制在此情況中也變得格外重要。Dong (2016)所提出的演算法法本身並不能進行選模。我們相信若資料分析方法可以基於負二項分配假設且同時具備變項挑選機制,必能有效改善其分類結果。本文中,我們提出了負二項分配線性判別分析來為RNA測序資料進行分類,並由廣義線性模型進行參數估計。該分類器是基於貝氏定理以及負二項分配所導出。在模擬結果中證明了我們的負二項分配假設結合選模機制能夠有效改善分類結果。我們也分析了一筆真實資料以體現真實情形下的實驗結果。藉由上述情形的比較,我們能夠宣稱我們所提出的分類方法對於RNA測序資料的分類是非常有效的。
In recent years, RNA sequencing (RNA-seq) has become a powerful technology to characterize gene-expression profile of organisms because of the capabilities of next-generation sequencing and better accuracy compared to microarrays. Classification of gene expression profiles has been a promising approach for the purposes of diagnosis and prognostic prediction for patients. Most of the statistical method that have developed for micorarray data are either based on Normal distribution assumption. Since RNA-seq collects count data and is different from the continuous measurement from microarray data, it is necessary to develop methods that are well suited for the specific property of RNA-seq data. Witten (2011) proposed a Poisson linear discriminant analysis for RNA-seq data. The Poisson assumption forces the variance to be the same with the mean, and it may not be appropriate for the real medical samples. Dong (2016) proposed a Negative Binomial linear discriminant analysis to fix this this problem. However, sequencing data usually exist the problem that number of features is relatively large compared to the number of samples. Dong (2016)’s algorithm cannot achieve sparsity. We believe a linear discriminant analysis based on Negative Binomial assumption with variable selection mechanism can improve the classification performance. In this paper, we propose a Negative Binomial linear discriminant analysis under the generalized linear model framework for RNA-seq data. The classifier is conducted according to the Bayes rule through fitting a Negative Binomial model. Simulation result shows that the model assumption and feature selection mechanism in our method can improve the performance of classifier. We also demonstrate the advantages of our method by analyzing an RNA-seq data in real-world scenario. Based on the comparison result, our proposed classifier can serve as an effective tool for RNA-seq data classification.
1 Introduction.....2
2 Existingandproposedmethods.....5
2.1 Existingmethod...........................6
2.1.1 Lineardiscriminantanalysis(LDA) ............6
2.1.2 Poissonlineardiscriminantanalysis(PLDA) .......7
2.1.3 Negativebinomiallineardiscriminantanalysis(NBLDA).....10
2.2 spaeseNegativebinomiallineardiscriminantanalysisviagener-
alize linearmodelapproach(sNBLDAglm) ............11
2.2.1 Modeldescription......................11
2.2.2 Estimationofmaine ect..................13
3 Simulationstudy.....15
3.1 Simulationdesign...........................15
3.2 Evaluation...............................16
3.3 Result.................................17
4 Applicationtorealdata.....33
4.1 Cervicaldataset...........................33
4.2 Evaluation...............................33
4.3 Result.................................34
5 DiscussionandConclusions.....38
6 Reference.....40
Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks."Machine learning 20.3 (1995): 273-297.

Dong, Kai, et al. "NBLDA: negative binomial linear discriminant analysis for RNA-Seq data." BMC bioinformatics 17.1 (2016): 369.

Dillies, Marie-Agnès, et al. "A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis." Briefings in bioinformatics 14.6 (2013): 671-683.

Dudoit, Sandrine, Jane Fridlyand, and Terence P. Speed. "Comparison of discrimination methods for the classification of tumors using gene expression data." Journal of the American statistical association 97.457 (2002): 77-87.

Hardcastle, Thomas J., and Krystyna A. Kelly. "baySeq: empirical Bayesian methods for identifying differential expression in sequence count data." BMC bioinformatics 11.1 (2010): 422.

Ho, David D., et al. "Rapid turnover of plasma virions and CD4 lymphocytes in HIV-1 infection." Nature 373.6510 (1995): 123.

Landau, William Michael, and Peng Liu. "Dispersion estimation and its effect on test performance in RNA-seq data analysis: a simulation-based comparison of methods." PloS one 8.12 (2013): e81415.

Law, Charity W., et al. "voom: Precision weights unlock linear model analysis tools for RNA-seq read counts." Genome biology 15.2 (2014): R29.

Lorenz, Douglas J., et al. "Using RNA-seq data to detect differentially expressed genes." Statistical analysis of next generation sequencing data. Springer, Cham, 2014. 25-49.

Love, Michael I., Wolfgang Huber, and Simon Anders. "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." Genome biology 15.12 (2014): 550.

Marioni, John C., et al. "RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays." Genome research 18.9 (2008): 1509-1517.

McCarthy, Davis J., Yunshun Chen, and Gordon K. Smyth. "Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation." Nucleic acids research 40.10 (2012): 4288-4297.

Morozova, Olena, Martin Hirst, and Marco A. Marra. "Applications of new sequencing technologies for transcriptome analysis." Annual review of genomics and human genetics 10 (2009): 135-151.

Mortazavi, Ali, et al. "Mapping and quantifying mammalian transcriptomes by RNA-Seq." Nature methods 5.7 (2008): 621.

Oshlack, Alicia, Mark D. Robinson, and Matthew D. Young. "From RNA-seq reads to differential expression results." Genome biology 11.12 (2010): 220.

Robinson, Mark D., and Gordon K. Smyth. "Small-sample estimation of negative binomial dispersion, with applications to SAGE data." Biostatistics 9.2 (2007): 321-332.

Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics 26.1 (2010): 139-140.

Tibshirani, Robert, et al. "Diagnosis of multiple cancer types by shrunken centroids of gene expression." Proceedings of the National Academy of Sciences 99.10 (2002): 6567-6572.

Tibshirani, Robert, et al. "Class prediction by nearest shrunken centroids, with applications to DNA microarrays." Statistical Science 18.1 (2003): 104-117.

Wang, Zhong, Mark Gerstein, and Michael Snyder. "RNA-Seq: a revolutionary tool for transcriptomics." Nature reviews genetics 10.1 (2009): 57.

Wang, Zhu, et al. "Penalized count data regression with application to hospital stay after pediatric cardiac surgery." Statistical methods in medical research 25.6 (2016): 2685-2703.

Witten, Daniela, et al. "Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls." BMC biology 8.1 (2010): 58.

Witten, Daniela M. "Classification and clustering of sequencing data using a Poisson model." The Annals of Applied Statistics 5.4 (2011): 2493-2518.

Yu, Danni, Wolfgang Huber, and Olga Vitek. "Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size." Bioinformatics 29.10 (2013): 1275-1282.

Zararsız, Gökmen, et al. "A comprehensive simulation study on classification of RNA-Seq data." PloS one 12.8 (2017): e0182507.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *