抽樣、權重、機率修正不平衡數據，並應用於決策樹分類__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.198) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	楊長鳴
作者(外文):	Yang, Chang-Ming
論文名稱(中文):	抽樣、權重、機率修正不平衡數據，並應用於決策樹分類
論文名稱(外文):	Sampling, Weighting and Probability Correction for Classifying Imbalanced Data Using Decision Trees
指導教授(中文):	徐茉莉
指導教授(外文):	Shmueli, Galit
口試委員(中文):	雷松亞林福仁
口試委員(外文):	Ray, Soumya Lin, Fu-Ren
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	服務科學研究所
學號:	104078518
出版年(民國):	106
畢業學年度:	105
語文別:	英文
論文頁數:	53
中文關鍵詞:	不平衡數據、決策樹、解釋性與預測性模型、權重、機率、抽樣、修正
外文關鍵詞:	imbalanced data、decision tree、Explanatory modeling、Predictive modeling、Sampling、Weights、Probability、Correction
相關次數:	推薦:0 點閱:1319 評分: 下載:73 收藏:0

此研究以三大面向來探索不平衡數據：（1）當類別數據不齊全時，如何估計類別數據在母體內的分佈。（2）比較邏輯迴歸分析、區別分析與決策樹分析在處理不平衡數據，所建出的解釋性模型與預測性模型間的差異。（3）依據分析目的考量合適的衡量指標。
此篇的研究問題：在探討邏輯迴歸分析、區別分析與決策樹分析在處理不平衡數據時，透過“減少多數抽樣法”處理的數據來建出分類模型，此模型以權重與機率修正（數據修正），並探討兩者的關係。驗證方式如下：
(1) “減少多數抽樣法”建出的決策樹模型，有權重修改與無權重修改模型之關係。
(2) 找出是否能藉由機率來修正決策樹，並還原成母體模型。
(3) 解釋性與預測性所建出的不平衡數據決策樹模型之間的差異。
我們利用不同的數據，透過數據修正來實驗，發現在運用邏輯迴歸分析與區別分析於修正不平衡數據時，訓練出來的解釋性與預測性模型並無差異。相反地，運用決策樹修正不平衡數據時，會有不同的解釋性與預測性模型，因此，當使用決策樹來建立不平衡數據模型時，須考量清楚分析目的。

In this dissertation, we study three analytical goals related to imbalanced data: (1) determining the population class distribution when it is difficult to obtain sufficient observations on one of the classes; (2) comparing explanatory and predictive modeling of imbalanced data using logistic regression, discriminant analysis, and decision trees; and (3) considering suitable performance evaluation metrics for different purposes.
Our research question is focused on comparing weighting and intercept correction for undersampled data using logistic regression, discriminant analysis, and decision trees (three models) when dealing with imbalanced data. Specifically, we study the following:
(1) The relationship between sampling data with a weighted decision tree and an unweighted decision tree.
(2) What are rules to induce a correction of the probability cutoff in a decision tree for obtaining the population model?
(3) What is the difference between explanatory and predictive modeling?
We study these questions using several datasets with different corrections. We find that when building training models across different data distributions, when the imbalanced data set is very large, using weighting and intercept correction with logistic regression and discriminant analysis lead to consistent results in both explanatory and predictive tasks. In contrast, decision trees display different results when we investigate explanatory factors (tree variables) and predicting or ranking new observations. Therefore, we should carefully deal with imbalanced data using decision trees.

Chapter 1: Introduction 9
1.1 Background and motivation 9
1.2 Research question 11
Chapter 2: Literature review 12
2.1 Models 12
2.1.1 Logistic regression 13
2.1.2 Discriminant analysis 13
2.1.3 Decision Tree 14
2.2 Explanatory modeling 15
2.2.1 Prior Correction for Logistic Regression with Rare Events 16
2.2.2 Prior Correction for Discriminant Analysis 16
2.3 Predictive modeling 17
2.3.1 Predictive methods 18
2.3.1.1 Sampling methods 18
2.4 Correction of the probability cutoff 19
2.4.1 Calibration in Logistic Regression 20
2.4.2 Calibration in Discriminant Analysis 21
2.4.3 Proposed Calibration in Decision Trees 22
2.4.4 Weight argument in glm() and rpart() function in R language
parameter/ class 23
2.5 Evaluation metrics 24
2.5.1 Confusion matrix 25
2.5.2 Lift Charts 26
Chapter 3: Experimental Framework 27
3.1 Experimental design 27
3.1 Experimental Datasets 28
3.2 Data Partition and Sampling Methods 29
3.2.1 Data Partition: 29
3.2.2 Sampling of The Training Data 29
3.3 Algorithms 31
3.4 Evaluation metrics using confusion matrix and lift charts 32
Chapter 4: Experimental results 34
4.1 Summary table of experimental results 34
4.1.1 Explanatory summary table (coefficients or tree variables) from the training model 34
4.1.2 Predictive summary table from test data 35
4.2 Detailed comparison for each dataset 36
4.2.1 Explanatory (coefficients or tree variables) from training model 36
4.2.1.1 Bank dataset - Coefficients/tree variables 36
4.2.1.2 Letter dataset - Coefficients/tree variables 36
4.2.1.3 Adult dataset - Coefficients 36
4.2.2 Predictive confusion matrix and lift charts (from test data) 40
4.2.2.1 Bank dataset - confusion matrix 40
4.2.2.2 Letter dataset - confusion matrix 40
4.2.2.3 Adult dataset - confusion matrix 40
4.2.2.4 Three datasets - Lift charts 42
Chapter 5: Conclusions and Future work 44
References 47
Appendices 49
A.1 Experiment results on small imbalanced datasets 49
Abalone dataset - Coefficients/tree variables 50
Abalone dataset - confusion matrix 51
Diseased Tree dataset - Coefficients/tree variables 51
Diseased Tree dataset - confusion matrix 52
Two datasets - Lift charts 53
A.2 R code for experimental models 53

Andrew, A. M. (2000). An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods by Nello Christianini and John Shawe-Taylor, Cambridge University Press, Cambridge.
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. CRC press.
Blake, C., & Merz, C. J. (1998). {UCI} Repository of machine learning databases.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. In European Conference on Principles of Data Mining and Knowledge Discovery, pp. 107-119. Springer Berlin Heidelberg.
Chen, C., Liaw, A., & Breiman, L. (2004). Using random forest to learn imbalanced data (Technical Report 666). Berkeley, CA: University of California, Berkeley, Department of Statistics.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on Neural Networks, pp. 1322-1328.
King, G., & Zeng, L. (2001). Logistic regression in rare events data. Political analysis, 137-163.
Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In ICML,Vol. 97, pp. 179-186.
Klecka, W. R. (1980). Discriminant analysis (No. 19). Sage.
Powers, D. M. (2011). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37–63.
Shmueli, G. (2010). To explain or to predict?. Statistical science, 25(3), 289-310.
Shmueli, G., Patel, N. R., & Bruce, P. C. (2016). Data Mining for Business Analytics: Concepts, Techniques, and Applications with XLMiner. 3rd Edition. John Wiley & Sons.
Sanders, E. B. N., & Stappers, P. J. (2008). Co-creation and the new landscapes of design. Co-design, 4(1), 5-18.
Therneau, T. M., & Atkinson, E. J. (1997). An introduction to recursive partitioning using the RPART routines (Vol. 61, p. 452). Mayo Foundation: Technical report.

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文