帳號:guest(3.14.251.128)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):傅品誠
作者(外文):Fu, Pin-Cheng
論文名稱(中文):結合裝袋算法及減少多數抽樣法於馬氏-田口系統來解決非平衡資料問題
論文名稱(外文):Integrating Bagging and Under-Sampling with MTS to Alleviate Class Imbalance Problem
指導教授(中文):蘇朝墩
指導教授(外文):Su, Chao-Ton
口試委員(中文):蕭宇翔
許俊欽
口試委員(外文):Hsiao, Yu-Hsiang
Hsu, Chun-Chin
學位類別:碩士
校院名稱:國立清華大學
系所名稱:工業工程與工程管理學系
學號:104034537
出版年(民國):106
畢業學年度:105
語文別:中文
論文頁數:61
中文關鍵詞:馬氏-田口方法裝袋算法減少多數抽樣法非平衡資料分類問題臨界值方法集成學習法
外文關鍵詞:MTSBaggingUnder-SamplingImbalance DataClassificationThreshold MethodEnsemble learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:438
  • 評分評分:*****
  • 下載下載:13
  • 收藏收藏:0
分類與預測的問題之中,常會遇到非平衡資料問題(class imbalance problem)。當資料集合中,一類別的資料筆數遠多於另一個類別,常會造成一般資料探勘模型在學習上,因為追求分類的準確率,而忽略資料筆數少但卻重要的類別。馬氏-田口方法(Mahalanobis-Taguchi system, MTS)為田口玄一博士所提出的分類與預測新技術。由於MTS的分類模型建構是透過量測尺度的建立,因此較不受資料分佈的影響。此外,近幾年應對非平衡資料的技術與方法漸漸發展出,其中集成學習法加上資料抽樣的混合方法在近期的研究中得到不錯的效果。因此,本研究提出一模型稱為MTSbag,其結合MTS與集成學習法中的裝袋算法(bagging),並且運用「減少多數抽樣法」(under-sampling)於MTS決定臨界值時的樣本中,進而增加MTS在面對非平衡資料的分類績效。同時,本研究亦提出MTS的屬性篩選方法,以及Gini臨界值方法來幫助MTS決定最佳臨界值。經由UCI數據與實際醫療案例分析,發現MTSbag在處理非平衡資料的分類與屬性篩選表現優於MTS;且在不平衡程度較高的資料集合以及個案資料中,MTSbag分類與屬性篩選表現亦較Adaboost、SVM和隨機森林的表現更優異,說明此方法在面對非平衡資料上有優異的分類與屬性篩選性能,並且可應用於實際問題中。
Class imbalance is a common condition in classification and prediction problem. When records of one class in a dataset are far more than the other class(es), the data mining algorithms prone to optimize the classification accuracy and ignore the important yet rarely occurring class(es). MTS, proposed by Dr. Taguchi, is a diagnosis and prediction technique. Different from the other data mining algorithms, MTS is constructed by establishing the measurement scale, so MTS is relatively unaffected by the data distribution. Moreover, many imbalance data techniques are developed by researchers recently. Among these techniques, the hybrid approach which combines the ensemble learning and data sampling method has received a good result on imbalance data. Therefore, this study proposes a model called MTSbag. It combines bagging and under-sampling approach with MTS. Meanwhile, this study also proposes the attribute selection approach of MTSbag and Gini threshold method to decide the threshold for MTS. Several imbalance data performance metrics were adopted to evaluate the performances of the proposed method. For the UCI dataset and real medical case, the results show that the classification and attribute selection performances of MTSbag are better than MTS, and are better than Adaboost, SVM and Random Forest in high imbalance dataset. This results reveal that MTSbag has an excellent classification and attribute selection performance on imbalance data, and also can deal with a real case.
【摘要】 I
【Abstract】 II
目錄 III
表目錄 V
圖目錄 VI
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 研究架構 3
第二章 文獻回顧 5
2.1馬氏-田口系統 5
2.1.1馬氏距離 5
2.1.2田口之穩健工程 6
2.1.3馬氏-田口系統之執行 6
2.1.4馬氏-田口系統之臨界值 9
2.2不平衡資料問題 11
2.3 集成學習法 12
2.3.1 提升算法(boosting) 13
2.3.2 Adaboost 13
2.3.3 裝袋算法(bagging) 15
2.3.4隨機森林 15
2.4 屬性篩選(attribute selection) 16
第三章 研究方法 18
3.1 Gini臨界值方法 18
3.2 MTSbag模型 19
3.3 MTSbag模型之屬性篩選方法 22
第四章 數據分析 23
4.1 衡量指標 23
4.2 Gini臨界值評估 26
4.2.1 數據 26
4.2.2 實驗設計摘要 27
4.2.3 效能評估 27
4.3 MTSbag屬性篩選方法評估 30
4.3.1 數據 30
4.3.2 實驗設計摘要 31
4.3.4 效能評估 32
4.4 MTSbag模型評估 36
4.4.1 數據 36
4.4.2 實驗設計摘要 37
4.4.3 分類模型及參數設定 38
4.4.4 效能評估 39
第五章 個案研究 46
5.1 案例背景 46
5.2建立資料探勘預測系統 47
5.2.1分析流程 48
5.2.2 各方法之效能評估 48
5.3改善重要因子以提高IHCA發生後的存活率 51
5.3.1 分析流程 51
5.3.2 各方法之效能評估 52
5.3.3 重要屬性 53
第六章 結論與建議 55
6.1 結論 55
6.2 未來研究 57
參考文獻 58
[1] Su, C.T. and Hsiao, Y.H. (2007), “An evaluation of the robustness of MTS for imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 10, pp. 1321-1332.
[2] Su, C.T. and Hsiao, Y.H. (2009), “Multiclass MTS for simultaneous feature selection and classification,” IEEE Transactions on Knowledge and Data Engineering, Vol. 21, No. 2, pp. 192-205.
[3] Su, C.T. (2013), “Quality Engineering: Off-line methods and applications,” CRC Press.
[4] Woodall, W.H., Koudelik, R., Tsui, K.L., Kim, S.B., Stoumbos, Z.G. and Carvounis, C.P. (2003), “A review and analysis of the Mahalanobis-Taguchi system,” Technometrics, Vol. 45, No. 1, pp. 1-15.
[5] Breiman, L. (1996). “Out-of-bag estimation,” Technical report, Statistics Department, University of California Berkeley, Berkeley CA 94708, 1996b. Vol. 33, No. 34, pp. 1-13.
[6] Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A, (2011). “Comparing Boosting and Bagging techniques with Noisy and Imbalanced Data,” IEEE Transactions on Systems. Man, .and Cybernetics-Part A. Systems and Humans. Vol. 41, No. 3, pp. 552-568.
[7] Breiman, L. (1996). “Bagging predictors,” Machine learning, Springer, Vol. 24, No. 2, pp. 123-140.
[8] Freund, Y., & Schapire, R. E. (1996). “Experiments with a new boosting algorithm,” In Icml, Vol. 96, pp. 148-156.
[9] Q. Yang and X. Wu, (2006). “10 challenging problems in data mining research,” International Journal of Information Technology & Decision Making, Vol. 5, No. 4, pp. 597–604.
[10] Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). “Classification and regression trees,” CRC press.
[11] Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). “A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), Vol. 42, No. 4, pp. 463-484.
[12] Schapire, R. E. (1990). “The strength of weak learnability,” Machine learning, Springer, Vol. 5, No. 2, 197-227.
[13] Hanifah, F. S., Wijayanto, H., & Kurnia, A. (2015). “SMOTEBagging Algorithm for Imbalanced Dataset in Logistic Regression Analysis (Case: Credit of Bank X),” Applied Mathematical Sciences, Vol. 9, No. 138, pp. 6857-6865.
[14] Freund, Y., & Schapire, R. E. (1995). “A desicion-theoretic generalization of on-line learning and an application to boosting,” European Conference on Computational Learning Theory, Springer Berlin Heidelberg, pp. 23-37.
[15] Khoshgoftaar, Taghi M., Moiz Golawala, and Jason Van Hulse. (2007). “An empirical study of learning from imbalanced data using random forest,” Tools with Artificial Intelligence, ICTAI 2007, 19th IEEE International Conference, Vol. 2.
[16] Breiman, Leo. (2001). “ Random forests,” Machine learning, Springer, Vol. 45, No. 1, pp. 5-32.
[17] Sun, Yanmin, et al. (2007). "Cost-sensitive boosting for classification of imbalanced data," Pattern Recognition, Vol 40, No. 12, pp. 3358-3378.
[18] Guo, H., & Viktor, H. L. (2004). “Learning from imbalanced data sets with boosting and data generation: the databoost-IM approach,” ACM Sigkdd Explorations Newsletter, Vol. 6, No. 1, pp. 30-39.
[19] Akbani, R., Kwek, S., & Japkowicz, N. (2004). “Applying support vector machines to imbalanced datasets,” In European conference on machine learning, pp. 39-50.
[20] Tan, P. N. (2006). “Introduction to data mining,” Pearson Education India.
[21] Therneau, T. M., Atkinson, B., & Ripley, M. B. (2010). “The rpart package.”
[22] Alfaro, E., Gamez, M., & Garcia, N. (2013). “Adabag: An R package for classification with boosting and bagging,” Journal of Statistical Software, Vol. 54(2), pp. 1-35.
[23] RColorBrewer, S., Liaw, A., Wiener, M., & Liaw, M. A. (2015). “Package randomForest”.
[24] Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). “SMOTEBoost: Improving prediction of the minority class in boosting,” In European Conference on Principles of Data Mining and Knowledge Discovery, Springer Berlin Heidelberg, pp. 107-119.
[25] Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). “RUSBoost: A hybrid approach to alleviating class imbalance,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, Vol. 40, No. 1, pp. 185-197.
[26] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, Vol. 16, pp. 321-357.
[27] Chen, Y. W., & Lin, C. J. (2006). “Combining SVMs with various feature selection strategies,” In Feature extraction. Springer Berlin Heidelberg, pp. 315-324.
[28] 邱俊仁、程俊傑、吳鋼治、邱浩彰、楊國卿、侯勝茂。(2012)。院內心肺復甦(CPR)之成效。台灣醫學,Vol. 16-1, pp. 34-39。
[29] 高靖翔。(2008)。多項分配之分類方法比較與實證研究,政治大學統計研究所碩士論文。
[30] 林承鋐。(2016)。運用基因表達規劃法於支持向量機的規則萃取,清華大學工業工程與工程管理研究所碩士論文。
[31] Swiniarski, R. W., & Skowron, A. (2003). “Rough set methods in feature selection and recognition,” Pattern recognition letters, Vol. 24, No. 6, pp. 833-849.
[32] Samb, Mouhamadou Lamine. (2012). “A novel RFE-SVM-based feature selection approach for classification,” International Journal of Advanced Science and Technology, Vol. 43, pp. 27-36.
[33] Su, C. T., & Li, T. S. (2002). “Neural and MTS algorithms for feature selection,” Asian Journal on Quality, Vol. 3, No. 2, pp.113-131.
[34] Yang, J., & Honavar, V. (1998). “Feature subset selection using a genetic algorithm,” IEEE Intelligent Systems, Vol. 2, pp. 44-49.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *