帳號:guest(18.222.21.178)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):陳奕安
作者(外文):Chen, Yi An
論文名稱(中文):基於健保資料預測中風之研究並以Hadoop作為一種快速擷取特徵工具
論文名稱(外文):Predicting Stroke based on Health Insurance Records by Using Hadoop as a Fast Feature Extraction Tool
指導教授(中文):李祈均
指導教授(外文):Lee, Chi Chun
口試委員(中文):藍祚鴻
林敬恒
曹昱
口試委員(外文):Lan, Tsuo Hung
Lin, Ching Heng
Tsao, Yu
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:102061569
出版年(民國):105
畢業學年度:104
語文別:中文
論文頁數:104
中文關鍵詞:HadoopGBDT全民健康保險研究資料庫醫療衛生資料分析
外文關鍵詞:HadoopGBDTNational Health Insurance Research Databaseanalysis of healthcare data
相關次數:
  • 推薦推薦:0
  • 點閱點閱:231
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
隨著電腦運行速度的提升、儲存技術的進步以及通訊技術的發展等原因,使人類可以使用的數據量大為增加,因而使得大數據的研究興起。在大數據研究興起的同時,同樣也造就了資料探勘領域的發展,讓人類得以從大數據中提取到有用的資訊。若能將大數據研究用於醫學領域,將會是可以達成改善照護、拯救生命以及降低開支等裨益人類甚多的研究。但隨著資料量不斷日益增長,使用一般單台機器循序式資料處理工具會耗費掉大量的時間。在耗費大量時間的同時,又會衍生其他因時間過慢產生的問題。若可使用分散式的平行運算框架,讓多台機器一起運算資料,將可以大幅減少運算時間。過去諸多研究表明,服用常被開立於治療憂鬱症或相關精神健康症狀的SSRI藥物,會增加中風的風險。本論文研究基於此些研究結果,使用全民健康保險研究資料庫進行醫療衛生資料分析,對曾經服用過SSRI相關藥物的人使用機器學習建模預測中風,其中資料處理使用分散式運算工具Hadoop加速處理速度。對比本實驗室之前的方法,同一組資料在預處理提升了約35倍的速度,擷取特徵提升了約585倍的速度,提升效果顯著。處理完的資料使用GBDT為分類器進行資料分析,因處理速度大為提升的情況下,得以擷取更多的特徵。藉由檢驗前20位最重要的特徵,最終結果顯現了我們的模型對比於本實驗室之前的方法,可以呈現更多的危險因子,此結果或為有價值的臨床資訊。
The amount of data that human beings can use increase numerously and lead to the rising of research for big data with many results like the enhancement of operating speed in computers, the advance of storage technology, the development of communications technology, etc. With the rising of research for big data, it also results in the development of data mining, making human beings get valuable information from big data. If the research for big data can be applied to medical field, the research that would achieve to improve care, save lives and lower costs benefits human beings a lot. However, using sequential data processing tool of one general machine costs numerous time with the increasingly growing amount of data. It leads to other problems for too slow time with numerous time costs. If a framework for distributed parallel computing can be used to process data with lots of machines, it will reduce computing time sharply. In the past, much research points out that intake of SSRIs which is commonly prescribed for treatment of depression or related mental health conditions has increased risk of stroke. Based on these research results, the thesis research uses National Health Insurance Research Database to analyze the healthcare data and builds machine learning model for predicting stroke with people of SSRI intake in the past. Using Hadoop, a tool of distributed computing, speeds up data processing. Compared to the previous work in our lab, the same group enhances approximately 35 times in preprocess and approximately 585 times in extracting features for speed. The effect of enhancement is obvious. The processed data use GBDT as classifier for analysis to build machine learning model. It is able to extract more features with the obvious enhancement of processing speed. By examining the top 20 most important features, the final result demonstrates that our model show more risk factors compared to the previous work in our lab, and it may possess valuable clinical information.
誌謝 i
中文摘要 ii
Abstract iii
目錄 v
表目錄 viii
圖目錄 xi
第一章 研究介紹 1
第二章 研究方法 4
2.1 問題之定義 4
2.2 全民健康保險研究資料庫 5
2.3 Hadoop 6
2.3.1 Hadoop Distributed File System (HDFS) 7
2.3.2 MapReduce 9
2.4 Machine Learning Model 12
2.4.1 Gradient Descent 13
2.4.2 Decision Tree 14
2.4.3 Adaptive Boosting (AdaBoost) 16
2.4.4 Gradient Boosted Decision Tree (GBDT) 18
第三章 實驗架構 22
3.1 預處理 23
3.1.1 預處理簡介 24
3.1.2 CD_OO聯合資料 25
3.1.3 機器學習建模所需的key 27
3.1.4 定義Label 31
3.2 擷取特徵 35
3.2.1 新舊方法使用特徵 35
3.2.2 找出所有類別分類 39
3.2.3 擷取特徵之MapReduce演算法架構 40
3.3 機器學習建模 47
第四章 結果與比較 49
4.1 舊方法進行預處理 49
4.1.1 以原始資料創建db檔 49
4.1.2 找出服用過SSRI藥物的病例條目 49
4.1.3 萃取ID&BIRTHDAY 50
4.1.4 使用舊方法找出服用SSRI藥物之人在2006至2011所有的病歷資料 50
4.1.5 使用舊方法進行label定義 51
4.2 新舊方法預處理時間比較 52
4.3 舊方法進行擷取特徵 52
4.3.1 擷取CD表特徵 52
4.3.2 擷取OO表特徵 53
4.3.3 擷取DD表特徵 53
4.3.4 生成資料檔 54
4.4 新舊方法擷取特徵時間比較 54
4.5 機器學習建模分析 55
4.5.1 資料取樣 55
4.5.2 資料的均勻性 57
4.5.3 資料的詳細訊息 58
4.5.4 九組新舊特徵最終結果 60
4.6 與其他機器學習演算法比較 62
4.6.1 與Stochastic Gradient Descent進行比較 62
4.6.2 與隨機森林進行比較 63
4.6.3 比較結果之結論 64
4.7 四組新舊特徵以GBDT建模情況下,不同樹的數量調校結果以及特徵重要性解析 65
第五章 結論與未來展望 82
5.1 結論 82
5.2 未來展望 83
參考文獻 84





附錄 89
[1] K. Cukier and V. Mayer-Schoenberger, "Rise of Big Data: How it’s Changing the Way We Think about the World," Foreign Aff., vol. 92, p. 28, 2013.
[2] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus, "Knowledge discovery in databases: An overview," AI Magazine, vol. 13, no. 3, p. 57, 1992.
[3] D. J. Hand, H. Mannila, and P. Smyth, Principles of data mining. MIT Press, 2001.
[4] F. D. Bushman et al., "Bringing it all together: Big data and HIV research," AIDS (London, England), vol. 27, no. 5, p. 835, 2013.
[5] V. Swarup and D. H. Geschwind, "Alzheimer’s disease: From big data to mechanism," Nature, vol. 500, no. 7460, pp. 34–35, 2013.
[6] J. Ginsberg et al., "Detecting influenza epidemics using search engine query data," Nature, vol. 457, no. 7232, pp. 1012–1014, 2009.
[7] T. B. Murdoch and A. S. Detsky, "The inevitable application of big data to health care," JAMA, vol. 309, no. 13, pp. 1351–1352, 2013.
[8] P. Groves, B. Kayyali, D. Knott, and S. Van Kuiken, "The ‘big data’revolution in healthcare," McKinsey Quarterly, vol. 2, 2013.
[9] W. Raghupathi and V. Raghupathi, "Big data analytics in healthcare: Promise and potential," Health Information Science and Systems, vol. 2, no. 1, p. 3, 2014.
[10] IBM: Data Driven Healthcare Organizations Use Big Data Analytics for Big Gains; 2013. http://www03.ibm.com/industries/ca/en/healthcare/documents/Data_driven_healthcare_organizations_use_big_data_analytics_for_big_gains.pdf.
[11] M. Cottle et al., "Transforming Health Care Through Big Data Strategies for leveraging big data in the health care industry," Institute for Health Technology Transformation, http://ihealthtran. com/big-data-in-healthcare, 2013.
[12] National Health Insurance Administration, Ministry of Health and Welfare, Taiwan, R.O.C. (2014). National Health Insurance Annual Report 2014-2015.
[13] G. Trifirò, J. Dieleman, E. F. Sen, G. Gambassi, and M. C. J. M. Sturkenboom, "Risk of Ischemic stroke associated with antidepressant drug use in elderly persons," Journal of Clinical Psychopharmacology, vol. 30, no. 3, pp. 252–258, 2010.
[14] J. W. Smoller et al., "Antidepressant use and risk of incident cardiovascular morbidity and mortality among postmenopausal women in the Women’s Health Initiative study," Archives of Internal Medicine, vol. 169, no. 22, pp. 2128–2139, 2009.
[15] C.-C. Hung, C.-H. Lin, T.-H. Lan, and C.-H. Chan, "The association of selective serotonin reuptake inhibitors use and stroke in geriatric population," The American Journal of Geriatric Psychiatry, vol. 21, no. 8, pp. 811–815, 2013.
[16] C.-S. Wu, S.-C. Wang, Y.-C. Cheng, and S. S.-F. Gau, "Association of cerebrovascular events with antidepressant use: A case-crossover study," American Journal of Psychiatry, vol. 168, no. 5, pp. 511–521, 2011.
[17] D. Shin, Y. H. Oh, C.-S. Eom, and S. M. Park, "Use of selective serotonin reuptake inhibitors and risk of stroke: A systematic review and meta-analysis," Journal of Neurology, vol. 261, no. 4, pp. 686–695, 2014.
[18] F. Angeleri, V. A. Angeleri, N. Foschi, S. Giaquinto, and G. Nolfe, "The influence of depression, social activity, and family stress on functional outcome after stroke," Stroke, vol. 24, no. 10, pp. 1478–1483, 1993.
[19] A. Patil, D. Huard, and C. J. Fonnesbeck, "PyMC: Bayesian stochastic modelling in python," Journal of Statistical Software, vol. 35, no. 4, p. 1, 2010.
[20] W. McKinney, Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. O’Reilly Media, Inc., 2012.
[21] Frost & Sullivan: Drowning in Big Data? Reducing Information Technology Complexities and Costs for Healthcare Organizations. http://www.emc.com/collateral/analyst-reports/frost-sullivan-reducing-information-technologycomplexities-ar.pdf.
[22] J. H. Friedman, "Greedy function approximation: A gradient boosting machine," Annals of Statistics, pp. 1189–1232, 2001.
[23] P. Viola and M. J. Jones, "Robust real-time face detection," International Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.
[24] B. Schuller et al., "Speaker independent speech emotion recognition by ensemble classification," in Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, 2005, pp. 864–867.
[25] L. Rokach and O. Maimon, "Top-down induction of decision trees classifiers-A survey," IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), vol. 35, no. 4, pp. 476–487, 2005
[26] Z. Zheng, K. Chen, G. Sun, and H. Zha, "A regression framework for learning ranking functions using relative relevance judgments," in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007, pp. 287–294.
[27] D. Borthakur, "The hadoop distributed file system: Architecture and design," Hadoop Project Website, vol. 11, no. 2007, p. 21, 2007.
[28] R. C. Taylor, "An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics," BMC Bioinformatics, vol. 11, no. Suppl 12, p. S1, 2010.
[29] F. Pedregosa et al., "Scikit-learn: Machine learning in Python," The Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[30] J. Burn et al., "Long-term risk of recurrent stroke after a first-ever stroke. The Oxfordshire Community Stroke Project [published erratum appears in stroke 1994 Sep;25(9):1887]," Stroke, vol. 25, no. 2, pp. 333–337, 1994.
[31] P. A. Wolf, R. B. D’Agostino, A. J. Belanger, and W. B. Kannel, "Probability of stroke: A risk profile from the Framingham study," Stroke, vol. 22, no. 3, pp. 312–318, 1991.
[32] T. B. Wyller, "Stroke and gender," The journal of gender-specific medicine: JGSM: the official journal of the Partnership for Women’s Health at Columbia, vol. 2, no. 3, pp. 41–45, 1998.
[33] S. J. Kittner et al., "Pregnancy and the risk of stroke," New England Journal of Medicine, vol. 335, no. 11, pp. 768–774, 1996.
[34] C. Meune, E. Touzé, L. Trinquart, and Y. Allanore, "High risk of clinical cardiovascular events in rheumatoid arthritis: Levels of associations of myocardial infarction and stroke through a systematic review and meta-analysis," Archives of Cardiovascular Diseases, vol. 103, no. 4, pp. 253–261, 2010.
[35] D. H. Solomon et al., "Patterns of cardiovascular risk in rheumatoid arthritis," Annals of the Rheumatic Diseases, vol. 65, no. 12, pp. 1608–1612, 2006.
[36] E. F. Wijdicks, J. R. Fulgham, and K. P. Batts, "Gastrointestinal bleeding in stroke," Stroke, vol. 25, no. 11, pp. 2146–2148, 1994.
[37] R. J. Davenport, M. S. Dennis, and C. P. Warlow, "Gastrointestinal hemorrhage after acute stroke," Stroke, vol. 27, no. 3, pp. 421–424, 1996.
[38] G. S. Sfyroeras, N. Roussas, V. G. Saleptsis, C. Argyriou, and A. D. Giannoukas, "Association between periodontal disease and stroke," Journal of Vascular Surgery, vol. 55, no. 4, pp. 1178–1184, 2012.
[39] S.-J. Janket, A. E. Baird, S.-K. Chuang, and J. A. Jones, "Meta-analysis of periodontal disease and risk of coronary heart disease and stroke," Oral Surgery, Oral Medicine, Oral Pathology, Oral Radiology, and Endodontology, vol. 95, no. 5, pp. 559–569, 2003.
[40] E. Agostoni, L. Fumagalli, P. Santoro, and C. Ferrarese, "Migraine and stroke," Neurological Sciences, vol. 25, no. S3, pp. s123–s125, 2004.
[41] S. Sacco, R. Ornello, P. Ripa, F. Pistoia, and A. Carolei, "Migraine and hemorrhagic stroke: A Meta-analysis," Stroke, vol. 44, no. 11, pp. 3032–3038, 2013.
[42] E. Barrett-Connor and K.-T. Khaw, "Diabetes mellitus: An independent risk factor for stroke?," American Journal of Epidemiology, vol. 128, no. 1, pp. 116–123, 1988.
[43] V. Mohsenin, "Sleep-related breathing disorders and risk of stroke editorial comment," Stroke, vol. 32, no. 6, pp. 1271–1278, 2001.
[44] E. Shahar et al., "Sleep-disordered breathing and cardiovascular disease: Cross-sectional results of the Sleep Heart Health Study," American Journal of Respiratory and Critical Care Medicine, vol. 163, no. 1, pp. 19–25, 2001.
[45] X. Gong and N. J. Sucher, "Stroke therapy in traditional Chinese medicine (TCM): Prospects for drug discovery and development," Trends in Pharmacological Sciences, vol. 20, no. 5, pp. 191–196, 1999.
[46] H. Kim, "Neuroprotective herbs for stroke therapy in traditional eastern medicine," Neurological Research, vol. 27, no. 3, pp. 287–301, 2005.
[47] P. Langhorne et al., "Medical complications after stroke: A multicenter study," Stroke, vol. 31, no. 6, pp. 1223–1229, 2000.
[48] T. S. Olsen, "Post-stroke epilepsy," Current Atherosclerosis Reports, vol. 3, no. 4, pp. 340–344, 2001.
[49] M. M. Najafabadi et al., "Deep learning applications and challenges in big data analytics," Journal of Big Data, vol. 2, no. 1, pp. 1–21, 2015.
[50] A. Coates et al., "Deep learning with COTS HPC systems," Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 1337–1345.
[51] G. Hinton et al., "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[52] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[53] D. Silver et al., "Mastering the game of go with deep neural networks and tree search," Nature, vol. 529, no. 7587, pp. 484–489, 2016.
(此全文限內部瀏覽)
電子全文
摘要檔
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top

相關論文

1. 透過語音特徵建構基於堆疊稀疏自編碼器演算法之婚姻治療中夫妻互動行為量表自動化評分系統
2. 一個利用人類Thin-Slice情緒感知特性所建構而成之全時情緒辨識模型新框架
3. 應用多任務與多模態融合技術於候用校長演講自動評分系統之建構
4. 基於多模態主動式學習法進行樣本與標記之間的關係分析於候用校長評鑑之自動化評分系統建置
5. 透過結合fMRI大腦血氧濃度相依訊號以改善語音情緒辨識系統
6. 結合fMRI之迴旋積類神經網路多層次特徵 用以改善語音情緒辨識系統
7. 針對實體化交談介面開發基於行為衡量方法於自閉症小孩之評估系統
8. 一個多模態連續情緒辨識系統與其應用於全域情感辨識之研究
9. 整合文本多層次表達與嵌入演講屬性之表徵學習於強健候用校長演講自動化評分系統
10. 利用聯合因素分析研究大腦磁振神經影像之時間效應以改善情緒辨識系統
11. 利用LSTM演算法基於自閉症診斷觀察量表訪談建置辨識自閉症小孩之評估系統
12. 利用多模態模型混合CNN和LSTM影音特徵以自動化偵測急診病患疼痛程度
13. 以雙向長短期記憶網路架構混和多時間粒度文字模態改善婚 姻治療自動化行為評分系統
14. 透過表演逐字稿之互動特徵以改善中文戲劇表演資料庫情緒辨識系統
15. 基於大腦靜息態迴旋積自編碼的fMRI特徵擷取器
 
* *