作者(外文):Chen, Chun-Ying
論文名稱(外文):Predicting Ischemic Stroke Recurrence using Electronic Health Record
指導教授(外文):Hsieh, Wen-Ping
外文關鍵詞:Electronic Health RecordIschemic Stroke RecurrenceMachine LearningRandom ForestGated Recurrent UnitAttention Mechanism
腦中風的復發一直是醫療領域中重要的研究問題,因此過去有許多研究開發了各種模型和分數來預測住院後腦中風復發的可能性。本研究使用的特徵組包含了以前文獻中未使用的特徵,例如實驗室檢測資料、醫療費用、以及非結構化的文字診斷變數。我們對原始特徵進行了特徵工程和特徵預處理,構建了多種機器學習模型,包括(1)邏輯回歸,(2)隨機森林,(3)基於注意力機制的GRU神經網絡來預測腦中風復發的概率,並比較了不同模型的預測表現。與過往的研究相比,我們的模型在接收者操作特征曲線下面積(Area under ROC curve)有不錯的表現。除此之外,本研究使用(1)特徵重要性、(2)部分依賴圖(Partial dependence plot)、(3)注意力機制可視化等方法來了解最具預測能力的重要特徵以及其影響預測的方向與程度。
Ischemic stroke recurrence has always been a serious problem in health care, therefore there has been various models and scores developed to predict the possibility of stroke recurrence after hospitalization. Instead of modeling stroke recurrence using only clinical risk factors as past researches did, we use a set of features that contained features not used before in literature, such as laboratory tests, hospitalization fees, and unstructured medical notes. We performed feature engineering and feature preprocessing on raw features, and constructed multiple machine learning models including (1) logistic regression, (2) random forest, and (3) attention based GRU neural network to model the probability of stroke recurrence, and reported the predictive performance for each model. Our model performance evaluated using area under receiver operating characteristic curve is satisfactory comparing to previous researches. Moreover, we performed (1) permutation feature importance, (2) partial dependence plot, and (3) visualization of attention using the above models to prevent black-box prediction and provide insights on what the most predictive features are and how the important features affected prediction.
Introduction 5
Methods 8
2.1 Analysis workflow 8
2.2 Neural network 9
2.3 GRU 11
2.4 Attention mechanism 13
2.5 Binary word feature generation 14
2.6 Word2vec text embedding 15
2.7 Logistic regression 15
2.8 Random forest 16
2.9 Permutation feature importance 18
2.10 Partial dependence plot 18
Results 19
3.1 Data 19
3.1.1 Data collection and filter 19
3.1.2 Data partition 21
3.1.3 Data features 21
3.2 Data preprocessing 22
3.2.1 Missing values 22
3.2.2 One hot encoding 22
3.2.3 Feature scaling 22
3.2.4 Text preprocessing 23
3.3 Feature engineering 23
3.4 Model performance 25
3.4.1 Conventional classification models 26
3.4.2 Different subset of training data points 27
3.4.3 Neural network model for text data 28
3.4.4 Ensemble model 31
3.4.5 Model performance comparison 32
3.5 Model interpretation 33
3.5.1 Features extracted by Random Forest 33
3.5.2 Interpretation of Neural network model on text data 40
Discussion 45
Reference 47
Appendix A Feature list before engineering 51
Appendix B Other partial dependence plots 53
