作者(外文):Chen, Hsing-Chun
論文名稱(外文):Performance of tree algorithms on imbalanced data under different sampling strategies
指導教授(外文):Shmueli, Galit
口試委員(外文):Lin, Fu-Ren
Ray, Soumya
不均衡資料集是日常應用中常見的現象,處理不均衡資料的的方法大約自西元兩千年開始發展。 抽樣法集從級演算法是其中用來分類不均衡資料集相當熱門的方式。過去的研究曾比較不同演 算法搭配不同抽樣法的成效。但過去的研究中,多用現存小於一萬筆的資料集作為研究資料集, 並且他們很少分辨抽樣法成效的不同原因是否來自演算法或是資料集的特色。
因此,在此研究中我們採用模擬資料集以探究資料特色 (資料集大小及不均衡的程度) 對抽樣 法 (non-sampling, oversampling, undersampling, and SMOTE) 的影響,並且選用多種決策樹 (C5.0 Single Tree, CART Single Tree, Boosted Tree, and Random Forest) 演算法去了解不同演算法 與抽樣法的交互影響。透過這樣的實驗設置,我們期望解答以下幾個問題:(1a) 我們是否應該 同時使用抽樣法及樹的叢集演算法以解決不均衡資料集的問題?(1b) 同時使用兩者會有什麼優 點?我們得到的答案是同時使用兩者的確可以有較好的表現,但抽樣法搭配單獨的樹 (C5.0 或 CART)或 Logistic Regression 而非叢集演算法其實可以表現得更好。於是我們進一步探討 (2a) 抽樣法的成效是否與演算法的種類有關?(2b) 兩者之間是否有綜效或是干擾?(2c) 若兩者間有 綜效或干擾,那是否會導致過度配飾的問題?我們發現 (2a) 及 (2b) 的答案是「是」,至於 (2c) 抽樣法的確可能造成過度配飾,但相對的它也可以減低過度配飾的問題。最後,我們想知道 (3) 是否抽樣法的效果與資料集的特色如資料集大小及資料集的不均衡程度相關?我們發現, 抽樣法的效果的確與資料的大小集抽樣後的不均衡程度有關。而資料集本身的不均衡程度則是 影響 C5.0 演算法比較多。
Imbalanced data are common in real-world applications and methods to tackle imbalanced issues have been developed since around 2000. Sampling methods and ensemble algorithms are popular techniques for classifying imbalanced data. Previous work has compared the effects of different sampling methods under different algorithms. However, previous researches have mostly used small and real-world datasets, without separating the effects of the algorithms used and the data characteristics. In this dissertation, we conduct a suite of simulations for classifying 2-class imbalanced datasets for the purpose of comparing the effect of different sampling methods (non- sampling, oversampling, undersampling, and SMOTE) on the performance of classification trees (C5.0 Single Tree, CART Single Tree, Boosted Tree, and Random Forest) with different variations and different data characteristics (specifically, data size and imbalance ratio). Another novelty of our work is that we use bootstrap to evaluate the sampling variation of different performance measures. With this setup, we answer the following questions: (1a) Should we use sampling methods and tree ensemble algorithms together to solve the imbalanced problem? (1b) What is the benefit from doing so? We find that we can benefit from using sampling methods and tree ensemble algorithms together, but CART, C5.0, or logistic regression (LR) with sampling methods perform better. We then ask (2a) Is the effectiveness of sampling related to the type of algorithm? (2b) Is there any synergy or interaction between over sampling methods and ensemble algorithms? And (2c) Will the sampling- algorithm interaction lead to overfitting ? We find that the answer to question (2a) and (2b) is yes. As for (2c), it might lead to overfitting, but it can alleviate the problem as well. Finally, we ask (3) Is the effectiveness of sampling related to different types of data, in terms of data size, and imbalance ratio? We show that the effectiveness of sampling is related to different types of data, in terms of data size and imbalance ratio after sampling. The original imbalance ratio make difference to the effectiveness of C5.0 algorithm.
