電子病歷資料庫於機器學習模型的應用及討論__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.146) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	湯子萱
作者(外文):	Tang, Zih-Syuan
論文名稱(中文):	電子病歷資料庫於機器學習模型的應用及討論
論文名稱(外文):	Critical Issues on Machine Learning Models for Electronic Health Records
指導教授(中文):	謝文萍
指導教授(外文):	Hsieh, Wen-Ping
口試委員(中文):	張國軒張國軒
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	統計學研究所
學號:	107024521
出版年(民國):	109
畢業學年度:	108
語文別:	英文
論文頁數:	64
中文關鍵詞:	電子病歷、中風復發、機器學習模型
外文關鍵詞:	Stroke、Recurrence、Machine、Learning、Models、Electronic、Health、Records
相關次數:	推薦:0 點閱:130 評分: 下載:0 收藏:0

電子病歷（Electronic Health Records）以電子方式管理個人終生健康狀況和保健行為資訊。它包括病歷，護理記錄，病史，住院記錄，出院記錄，檢查報告，ICD-9和ICD-10代碼。在這項研究中，我們希望通過電子病歷預測五年後中風復發。我們使用的分類模型包括Logistic Regression，Random Forest，AdaBoost和XGBoost，以EHR作為預測因子，並使用AUC評估模型的性能。
在建置模型時，模型的預設遇到了幾個問題。其中包括大量的遺失值，龐大的關鍵詞集以及不同院區間的異質性。我們使用插補方法解決遺失值問題，而非丟棄具有大量缺失值的變數。根據臨床醫生的背景知識，我們開發了一種算法來確定每個特徵的插補值。通過特徵選擇，我們發現患者的缺失模式是有意義的。另外，我們也討論了來自同個醫院體系的三個不同院區的性能，並顯示了不同模型的樣本量效應以及特徵量效應。

Electronic health record (EHR) is an electronic way to manage personal life-long health status and health care behavior information. It is complicated including medical records, nursing records, medical history, hospitalization record, discharge records, inspection reports, ICD-9 and ICD-10 codes. In this study, we want to predict the recurrence of stroke in five years with EHR. The classification models we built include Logistic Regression, Random Forest, AdaBoost and XGBoost with EHR as predictors. We evaluate the performance of the models with AUC.
We encountered several issues with the default settings of those general models. That includes missing values, a huge key word set and heterogeneity of different hospitals. Instead of dropping the features with lots of missing values, we considered an imputation method. By the background knowledge from clinicians, we developed an algorithm to determine the imputing value for each feature. By feature selection, we found the missing patterns of patients are meaningful. Missing values of some variables stand for better conditions and will signal a promising outcome. We discussed the performance of three cohorts from different hospitals and show the sample size effect as well as feature size effect of different models.

1 Introduction 1
2 Data 6
2.1 Data source 6
2.2 Data partition 6
2.3 Data features 8
2.4 Data preprocessing 10
2.4.1 Imputation for laboratory data 11
2.4.2 Keyword generating from medical notes 15
3 Method 17
3.1 Logistic regression 17
3.2 Random Forest 17
3.3 Adaptive Boosting (AdaBoost) 20
3.4 EXtreme Gradient Boosting (XGBoost) 21
3.2 Evaluation 23
3.3 Variable selection 24
4 Result 24
4.1 Model performance 24
4.2 Sample sizes effect 27
4.3 Feature sizes effect 31
4.4 Important features 33
5 Conclusion and discussion 36
Reference 38
Appendix A: Feature list 40

1. Kernan WN, Viscoli CM, Brass LM, Makuch RW, Sarrel PM, Roberts RT, et al. The Stroke Prognosis Instrument II (SPI-II) A Clinical Prediction Instrument for Patients with Transient Ischemia and Nondisabling Ischemic Stroke. Stroke 2000, 31: 456-62.
2. Chen WQ, Pan YS, Zhao XQ, Liao XL, Liu LP, Wang CJ, et al. Totaled Health Risks in Vascular Events Score Predicts Clinical Outcome and Symptomatic Intracranial Hemorrhage in Chinese Patients After Thrombolysis. Stroke 2015; 46: 864-6.
3. Weimar C, Diener HC, Alberts MJ, Steg PG, Bhatt DL, Wilson PWF, Mas JL, Röther J. The Essen Stroke Risk Score Predicts Recurrent Cardiovascular Events A Validation Within the REduction of Atherothrombosis for Continued Health (REACH) Registry. Stroke 2009; 40(2): 350-4.
4. Sumi S, Origasa H, Houkin K, Terayama Y, Uchiyama S, Daida H, et al. A modified Essen stroke risk score for predicting recurrent cardiovascular events: development and validation. International Journal of Stroke 2013; 8(6): 251-7.
5. Xu Y, Yang XL, et al. Extreme Gradient Boosting Model Has a Better Performance in Predicting the Risk of 90-Day Readmissions in Patients with Ischaemic Stroke. Journal of Stroke and Cerebrovascular Diseases 2019; 28(12): 104441.
6. Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. IEEE Journal of Biomedical and Health Informativs 2018; 22(5): 1589-604.
7. Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu PJ, Liu XB, Marcus J, Sun MM, et al. Scalable and accurate deep learning with electronic health records. npj Digital Medicine 2018, 1(1): 1-18.
8. Donzé, J., Aujesky, D., Williams, D. & Schnipper, J. L. Potentially Avoidable 30-day Hospital Readmissions in Medical Patients: Derivation and Validation of a Prediction Model. JAMA Internal Medicine 2013; 173: 632–8.

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文