帳號:guest(3.138.33.120)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):李修逸
作者(外文):Li, Siou-Yi
論文名稱(中文):利用機器學習方法分析選擇偏誤之倖存資料
論文名稱(外文):Applying Machine Learning Approach to Analyze Survival Data with Selection Bias
指導教授(中文):鄭又仁
指導教授(外文):Cheng, Yu-Jen
口試委員(中文):黃冠華
邱燕楓
口試委員(外文):Huang, Guan-Hua
Chiu, Yen-Feng
學位類別:碩士
校院名稱:國立清華大學
系所名稱:統計學研究所
學號:106024501
出版年(民國):108
畢業學年度:107
語文別:中文
論文頁數:49
中文關鍵詞:選擇偏誤機器學習傾向分數因果推論極限梯度提升法
外文關鍵詞:selection biasmachine learningpropensity scorecausal inferenceXGBoost
相關次數:
  • 推薦推薦:0
  • 點閱點閱:282
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
倖存分析中,若是想要比較不同治療方式的好壞,或者評估接受治療與沒有接受治療病人之差異,通常會藉由估計他們的倖存函數來達成。在資料蒐集的過程中,可能會產生兩種偏誤,一為選取的研究對象與母體解釋變數分布不同產生的偏誤,稱為選擇偏誤,若資料有選擇偏誤,可能會得到錯誤的分析結果;二為實驗組與對照組的解釋變數分布不平衡產生的偏誤,假若有這種偏誤情形發生,所得到的分析結果就沒辦法具有因果的解釋。本篇論文目標為同時考慮這兩種偏誤下,估計出兩種治療方式的因果倖存函數,我們利用選擇機率倒數加權法修正選擇樣本的偏誤,利用傾向分數平衡治療之間自變數分布,並且使用經過選擇機率倒數加權調整後的治療機率倒數加權法、分層估計法以及Kernel smoothing法估計因果倖存函數。此外,傳統傾向分數的估計通常是使用簡單的羅吉斯回歸,而在我們研究中則是用到機器學習中的極限梯度提升法來估計傾向分數,在反應變數與解釋變數間的關係非常複雜時也能得到優良的預測結果。由於資料存在選擇偏誤,所以我們也有對演算法做出一些調整,另外我們還使用數種不同的選模方法來決定迭代次數,以得到最適當的模型複雜度。文章中進行了模擬來檢驗我們所提出的方法,並實際應用在肝細胞癌的資料上。
In survival analysis, we can assess the causal effect of different treatments by estimating their causal survival functions. In the process of data collection, there may exist two kinds of biases for the estimation of the causal treatment effect. One is the selection bias since some patients are restricted to receive a specified treatment. The other one is the bias caused by the imbalance of the distributions of explanatory variables between the treatment groups. Without correcting these biases, the analyzing results can be biased and have no causal explanations. In this paper, we use inverse probability of selection weighting method to correct selection bias, and use propensity score to balance the distribution of explanatory variables between treatments. The causal survival function is estimated by using inverse probability of treatment weighting, stratification, and kernel smoothing method accompanied by the inverse probability of selection weighting adjustment, where the extreme gradient boosting is used to estimate the propensity score with the adjustment for the selection bias. In addition, we also used several different model selection approaches to determine the number of iterations to get the most appropriate model complexity. The proposed methods are examined through the simulation study and applied to a real data set.
目錄
1 緒論1
2 文獻回顧3
2.1 倖存資料介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 符號定義. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Kaplan-Meier 倖存函數估計. . . . . . . . . . . . . . . . . . 4
2.2 因果推論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 假設. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 因果倖存函數. . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 選擇偏誤. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 極限梯度提升法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 模型設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.2 梯度提升樹. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.3 二元反應變數. . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 研究方法11
3.1 選擇偏誤資料之因果倖存函數估計. . . . . . . . . . . . . . . . . . 11
3.1.1 傾向分數. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Inverse probability of treatment weighting . . . . . . . . . . 12
3.1.3 分層估計法. . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.4 Kernel smoothing . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 利用極限梯度提升法估計傾向分數. . . . . . . . . . . . . . . . . . 15
3.2.1 一般迭代法. . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Hat matrix 近似迭代法. . . . . . . . . . . . . . . . . . . . 16
3.3 迭代次數選擇. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Likelihood 趨勢選擇法. . . . . . . . . . . . . . . . . . . . . 17
3.3.2 AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.3 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 模擬20
4.1 模擬一估計選擇偏誤資料之因果倖存函數. . . . . . . . . . . . . 20
4.2 模擬一參數設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 模擬一結果分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 模擬二選擇極限梯度提升法之迭代次數. . . . . . . . . . . . . . 23
4.5 模擬二結果分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 實例分析25
5.1 資料說明. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 敘述性分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 模型分析結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 結論28
參考文獻29
附錄31
A 附表與附圖31
A.1 模擬一. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A.2 模擬二. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
A.3 實例分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
[1] Bühlmann, P. and Hothorn, T. (2007). Boosting algorithms: Regularization,
prediction and model fitting. Statistical Science, 22(4), page 477-505.
[2] Chen, T. and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System.
Proceedings of the 22nd acm sigkdd international conference on knowledge
discovery and data mining, page 785-794.
[3] Cheng, Y. J. and Wang, M. C. (2012). Estimating propensity scores and causal
survival functions using prevalent survival data. Biometrics, 68(3), page 707-
716.
[4] Cheng, Y. J. and Wang, M. C. (2015). Causal estimation using semiparametric
transformation models under prevalent sampling. Biometrics, 71(2), page 302-
312.
[5] Freund, Y. and Schapire, R. E. (1996, July). Experiments with a new boosting
algorithm. icml, page 148-156.
[6] Friedman, J. H. (2001). Greedy function approximation: A gradient boosting
machine. Annals of Statistics, page 1189-1232.
[7] Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling
without replacement from a finite universe. Journal of the American statistical
Association, 47(260), page 663-685.
[8] Hothorn, T. and Bühlmann, P. (2006). Model-based boosting in high dimensions.
Statistical Science, 22(22), page 2828-2829.
[9] Janson, L., Fithian, W. and Hastie, T. J. (2015). Effective degrees of freedom:
a flawed metaphor. Biometrika, 102(2), page 479-485.
[10] Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete
observations. Journal of the American statistical association, 53(282), page
457-481.
[11] Li, H. and Luan, Y. (2005). Boosting proportional hazards models using
smoothing splines, with applications to high-dimensional microarray data.
Bioinformatics, 21(10), page 2403-2409.
[12] Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via
the propensity score in estimation of causal treatment effects: a comparative
study. Statistics in medicine, 23(19), page 2937-60.
[13] Pan, Q. and Schaubel, D. E. (2008). Proportional hazards models based on
biased samples and estimated selection probabilities. Canadian Journal of
Statistics, 36(1), page 111-127.
[14] Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity
score in observational studies for causal effects. Biometrika, 70(1), page 41-55.
[15] Tutz, G. and Binder, H. (2006). Generalized additive modeling with implicit
variable selection by likelihood‐based boosting. Biometrics, 62(4), page 961-
971.
[16] Zhang, M. and Schaubel, D. E. (2012). Double‐robust semiparametric estimator
for differences in restricted mean lifetimes in observational studies.
Biometrics, 68(4), page 999-1009.
(此全文未開放授權)
電子全文
中英文摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *