決策樹演算法搭配不同抽樣法於不均衡資料集的表現__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.157) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	陳幸君
作者(外文):	Chen, Hsing-Chun
論文名稱(中文):	決策樹演算法搭配不同抽樣法於不均衡資料集的表現
論文名稱(外文):	Performance of tree algorithms on imbalanced data under different sampling strategies
指導教授(中文):	徐茉莉
指導教授(外文):	Shmueli, Galit
口試委員(中文):	林福仁雷松亞
口試委員(外文):	Lin, Fu-Ren Ray, Soumya
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	服務科學研究所
學號:	104078508
出版年(民國):	107
畢業學年度:	107
語文別:	英文
論文頁數:	94
中文關鍵詞:	不均衡資料集、不平衡資料集、過度配適、資料特性、決策樹演算法、叢集演算法、集成學習算法、整體學習算法、抽樣法
外文關鍵詞:	overfitting、characterics、data、bootstrap、variation、metrics、evaluation、undersampling、oversampling、sampling、bagging、boosting、algorithms、ensemble、CART、C5.0、tree、Imbalanced
相關次數:	推薦:0 點閱:107 評分: 下載:18 收藏:0

不均衡資料集是日常應用中常見的現象，處理不均衡資料的的方法大約自西元兩千年開始發展。抽樣法集從級演算法是其中用來分類不均衡資料集相當熱門的方式。過去的研究曾比較不同演算法搭配不同抽樣法的成效。但過去的研究中，多用現存小於一萬筆的資料集作為研究資料集，並且他們很少分辨抽樣法成效的不同原因是否來自演算法或是資料集的特色。
因此，在此研究中我們採用模擬資料集以探究資料特色 (資料集大小及不均衡的程度) 對抽樣法 (non-sampling, oversampling, undersampling, and SMOTE) 的影響，並且選用多種決策樹 (C5.0 Single Tree, CART Single Tree, Boosted Tree, and Random Forest) 演算法去了解不同演算法與抽樣法的交互影響。透過這樣的實驗設置，我們期望解答以下幾個問題:(1a) 我們是否應該同時使用抽樣法及樹的叢集演算法以解決不均衡資料集的問題?(1b) 同時使用兩者會有什麼優點?我們得到的答案是同時使用兩者的確可以有較好的表現，但抽樣法搭配單獨的樹 (C5.0 或 CART)或 Logistic Regression 而非叢集演算法其實可以表現得更好。於是我們進一步探討 (2a) 抽樣法的成效是否與演算法的種類有關?(2b) 兩者之間是否有綜效或是干擾?(2c) 若兩者間有綜效或干擾，那是否會導致過度配飾的問題?我們發現 (2a) 及 (2b) 的答案是「是」，至於 (2c) 抽樣法的確可能造成過度配飾，但相對的它也可以減低過度配飾的問題。最後，我們想知道 (3) 是否抽樣法的效果與資料集的特色如資料集大小及資料集的不均衡程度相關?我們發現，抽樣法的效果的確與資料的大小集抽樣後的不均衡程度有關。而資料集本身的不均衡程度則是影響 C5.0 演算法比較多。

Imbalanced data are common in real-world applications and methods to tackle imbalanced issues have been developed since around 2000. Sampling methods and ensemble algorithms are popular techniques for classifying imbalanced data. Previous work has compared the effects of different sampling methods under different algorithms. However, previous researches have mostly used small and real-world datasets, without separating the effects of the algorithms used and the data characteristics. In this dissertation, we conduct a suite of simulations for classifying 2-class imbalanced datasets for the purpose of comparing the effect of different sampling methods (non- sampling, oversampling, undersampling, and SMOTE) on the performance of classification trees (C5.0 Single Tree, CART Single Tree, Boosted Tree, and Random Forest) with different variations and different data characteristics (specifically, data size and imbalance ratio). Another novelty of our work is that we use bootstrap to evaluate the sampling variation of different performance measures. With this setup, we answer the following questions: (1a) Should we use sampling methods and tree ensemble algorithms together to solve the imbalanced problem? (1b) What is the benefit from doing so? We find that we can benefit from using sampling methods and tree ensemble algorithms together, but CART, C5.0, or logistic regression (LR) with sampling methods perform better. We then ask (2a) Is the effectiveness of sampling related to the type of algorithm? (2b) Is there any synergy or interaction between over sampling methods and ensemble algorithms? And (2c) Will the sampling- algorithm interaction lead to overfitting ? We find that the answer to question (2a) and (2b) is yes. As for (2c), it might lead to overfitting, but it can alleviate the problem as well. Finally, we ask (3) Is the effectiveness of sampling related to different types of data, in terms of data size, and imbalance ratio? We show that the effectiveness of sampling is related to different types of data, in terms of data size and imbalance ratio after sampling. The original imbalance ratio make difference to the effectiveness of C5.0 algorithm.

Abstract 1
摘要 2
Acknowledgment 3
Table of contents 4
List of Figures 5
List of Tables 6
1. Introduction 9
2. Literature Review 11
2.1 Sampling Methods 11
2.2 Tree Algorithms 12
2.3 Data Partitioning and Sampling Variation 15
3. Simulation Design 16
3.1 Data Characteristics 16
3.2 Sampling Methods 16
3.3 Capturing Sampling Variation 17
3.4 Algorithms 19
3.5 Evaluations 19
3.6 Combining the 5 components 22
4. Simulation Results 23
4.1 Comparing the Numerical Metrics 24
4.2 Comparing the Graphical Performance Charts 33
4.2.1 Comparing CART family algorithms 34
4.2.2 Comparing CART, C5.0, and XGBoost 38
5. Conclusion 41
6. References 44
Appendix 47

[1] Ali, A., Shamsuddin, S. M., & Ralescu, A. L. (2015). Classification with class imbalance problem: A Review. Int. J. Advance Soft Compu. Appl, 7(3).
[2] Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1), 20-29.
[3] Barua, S., Islam, M. M., Yao, X., & Murase, K. (2014). MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering, 26(2), 405-425.
[4] Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.
[5] Chawla, N. V., Hall, L. O., Bowyer, K. W., & Kegelmeyer, W. P. (2002). Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 321– 357.
[6] Chawla, N. V., et al. (2004). "Editorial: special issue on learning from imbalanced data sets." SIGKDD Explor. Newsl. 6(1): 1-6.
[7] Chawla, N. V. (2005). Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery handbook (pp. 853-867). Springer US.
[8] Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794). ACM.
[9] Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. Multiple Classifier Systems: First International Workshop, MCS 2000 Cagliari, Italy, June 21–23, 2000 Proceedings. Berlin, Heidelberg, Springer Berlin Heidelberg: 1-15.
[10] Di Martino, M., Fernández, A., Iturralde, P., & Lecumberry, F. (2013). Novel classifier scheme for imbalanced problems. Pattern Recognition Letters, 34(10), 1146-1151.
[11] Domingos, P. (1999, August). Metacost: A general method for making classifiers cost- sensitive. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 155-164). ACM.
[12] Drummond, C., & Holte, R. C. (2003). C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning.
[13] Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.
[14] Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484.
44
[15] Gu, Q., Zhu, L., & Cai, Z. (2009, October). Evaluation measures of the classification performance of imbalanced data sets. In International Symposium on Intelligence Computation and Applications (pp. 461-471). Springer Berlin Heidelberg.
[16] Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine learning, 77(1), 103-123.
[17] He, H. B., & Garcia, E. A. (2009). Learning from Imbalanced Data. Ieee Transactions on Knowledge and Data Engineering, 21(9), 1263-1284. doi:10.1109/tkde.2008.239
[18] Hilborn, C. (1968). Dg lainiotis. IEEE transactions on information theory.
[19] Hoens, T. R., & Chawla, N. V. (2013). Imbalanced datasets: from sampling to classifiers. Imbalanced Learning: Foundations, Algorithms and Applications. Wiley, 43-59.
[20] Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), 429-449.
[21] Japkowicz, N. (2000). Learning from imbalanced data sets: a comparison of various strategies. AAAI Workshop on Learning from Imbalanced Data Sets (AAAI’00) (pp. 10–15). [22] Japkowicz, N. (2000, June). The class imbalance problem: Significance and strategies. In Proc. of the Int’l Conf. on Artificial Intelligence.
[23] Jeatrakul, P., Wong, K., & Fung, C. (2010). Classification of imbalanced data by combining the complementary neural network and SMOTE algorithm. Neural Information Processing. Models and Applications, 152-159.
[24] Kubat, M., & Matwin, S. (1997, July). Addressing the curse of imbalanced training sets: one-sided selection. In ICML (Vol. 97, pp. 179-186).
[25] Lin, H. T. (2010). Cost-sensitive classification: Status and beyond. In Workshop on Machine Learning Research in Taiwan: Challenges and Directions.
[26] Liu, A. Y. C. (2004). The effect of oversampling and undersampling on classifying imbalanced text datasets, Doctoral dissertation, The University of Texas at Austin.
[27] Pandya, R., & Pandya, J. (2015). C5. 0 algorithm to improved decision tree with feature selection and reduced error pruning. International Journal of Computer Applications, 117(16).
[28] Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2), 131-169.
[29] Quinlan, J. R. (1996, August). Bagging, boosting, and C4. 5. In AAAI/IAAI, Vol. 1 (pp. 725-730).
[30] Quinlan, J. R. (1993). C4. 5: Programs for Machine Learning. [31] Quinlan, R. (2004). Data mining tools See5 and C5. 0.
[32] Roumani, Y. F., May, J. H., Strum, D. P., & Vargas, L. G. (2013). Classifying highly imbalanced ICU data. Health care management science, 16(2), 119-128.
[33] Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2008, November). Improving learner performance with data sampling and boosting. In Tools with Artificial
45
Intelligence, 2008. ICTAI'08. 20th IEEE International Conference on (Vol. 1, pp. 452-459). IEEE.
[34] Shmueli, G., Patel, N. R., & Bruce, P. C. (2016). Data Mining for Business Analytics: Concepts, Techniques, and Applications with XLMiner. John Wiley & Sons.
[35] Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(04), 687- 719.
[36] Steinberg, D., & Colla, P. (2009). CART: classification and regression trees. The top ten algorithms in data mining, 9, 179.
[37] Tomek, I. (1976) Two modifications of CNN. IEEE Transaction on Systems Man and Communications, 6, 769-772.
[38] Van Hulse, J., Khoshgoftaar, T. M., & Napolitano, A. (2007). Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th International Conference on Machine Learning (ICML), pp. 935-942. ACM.
[39] Weiss, G. M., McCarthy, K., & Zabar, B. (2007). Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs?. DMIN, 7, 35-41.
[40] Yap, B. W., Rani, K. A., Rahman, H. A. A., Fong, S., Khairudin, Z., & Abdullah, N. N. (2014). An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013) (pp. 13-22). Springer Singapore.

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文