帳號:guest(18.116.37.31)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):洪子涵
作者(外文):Hung, Tzu-Han
論文名稱(中文):探討不平衡預測變數對識別差異性預測變數之影響
論文名稱(外文):Investigating the effects of unbalanced predictors on identifying discriminatory predictors
指導教授(中文):徐茉莉
指導教授(外文):Shmueli, Galit
口試委員(中文):雷松亞
林福仁
口試委員(外文):Ray, Soumya
Lin, Fu-Ren
學位類別:碩士
校院名稱:國立清華大學
系所名稱:服務科學研究所
學號:104078503
出版年(民國):107
畢業學年度:106
語文別:英文
論文頁數:36
中文關鍵詞:自動化決策不平衡問題不平衡預測變數區別對待分類分類樹邏輯迴歸
外文關鍵詞:automated decision makingdata imbalancepredictor imbalancediscriminationclassificationclassification treelogistic regression
相關次數:
  • 推薦推薦:0
  • 點閱點閱:303
  • 評分評分:*****
  • 下載下載:19
  • 收藏收藏:0
不平衡問題 (data imbalance)是目前資料探勘領域中所面臨的重要挑戰之一。至今為止的研究主要聚焦在結果變數 (outcome variable) 的類別不平衡問題 (class imbalance),而本研究旨在探討另一類型的不平衡問題:預測變數之不平衡問題 (predictor imbalance),此問題可能導致資料探勘結果對少數族群的不利決策。就研究者目前所知,此一議題尚未在研究文獻中被提及探討。

本研究運用實驗設計來檢驗不平衡預測變數在分類樹 (classification trees) 及邏輯迴歸 (logistic regression) 上的影響。研究結果發現,即使某一不平衡預測變數對於樣本中的某些少數族群(如,某特定人種、疾病等)擁有絕佳的分類能力,但在以不純度為分類基礎的分類樹 (impurity based trees) 及以統計方法為分類基礎的分類樹 (statistic-based trees) 中,該預測變數的重要性極可能被忽略。若忽略此預測變數,在自動化決策蓬勃發展的今日,資料探勘結果很可能導致決策的偏誤,使得決策對於少數族群不利。本研究並提出針對不平衡預測變數之檢測、處理做法。
One key challenge for the data mining community is the problem of data imbalance. While the vast majority of research focuses on the outcome class imbalance problem, this research investigates another type of data imbalance: the predictor imbalance, a problem that would lead to discrimination against minority groups in an automated decision making process.

In this research, we examine the effects of predictor imbalance on classification trees and logistic regression. We posit that unbalanced predictors are likely to be ignored in impurity based trees and some statistic-based trees even when they can perfectly classify the observations in rare subgroups (e.g., rare human races, diseases, etc.). Ignoring such an unbalanced predictor may lead to unjust decision making. This is particularly an issue nowadays as many managerial decisions are driven by AI and data analytical outputs. Guidelines to detect and address discrimination based on unbalanced predictors are also provided.
1. Introduction 2
2. Literature Review 3
2.1 The Data Imbalance Problem 3
2.1.1 Class Imbalance 3
2.1.2 Within-Class Imbalance 3
2.1.3 Unbalanced Predictors and Rare Cases 3
2.2 Classification Tree 5
2.2.1 Classification Tree 5
2.2.2 Splitting Criteria 6
2.2.3 Impurity-based Criteria 6
2.2.4 Statistics-based Criteria 7
2.2.5 Problems of Trees 8
2.3 Logistic Regression 8
2.3.1 Logistic Regression 8
2.3.2 Separation 9
3. Effect of An Unbalanced Predictor on Trees and Logistic Regression 10
3.1 Impurity-based Trees 10
3.2 Statistics-based Trees and Logistic Regression 11
3.3 A Relevant Issue: Separation 11
4. Experimental Design 12
4.1 Data Example: Survival on the Titanic 12
4.2 Modified Titanic Dataset 12
4.3 Study 1: Classification Tree 13
4.4 Study 2: Logistic Regression 14
5. Experimental Results 14
5.1 Study 1: Classification Tree 14
5.1.1 Effect of Under/over-sampling the Unbalanced Predictor 15
5.1.2 Effect of Unbalanced Predictor: Ratio vs. Number of Observations 18
5.1.3 Effect of Discriminatory Power of Unbalanced Predictor 21
5.1.4 Effect of Strength of Other Predictors 26
5.2 Study 2: Logistic Regression 28
5.2.1 Separation Effect 28
5.2.2 Effect of under/over-sampling the unbalanced predictor 30
6. Guidelines, Conclusions, and Future Directions 32
6.1 Guidelines 32
6.2 Conclusions 33
6.3 Future Directions 33
References 35
Acock, A., & Stavig, G. (1979). A Measure of Association for Nonparametric Statistics. Social Forces,57(4), 1381-1386. doi:10.2307/2577276
Aggarwal, C. C., & Reddy, C. K. (Eds.). (2013). Data clustering: algorithms and applications. Chapman and Hall/CRC.
Albert, A., & Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71(1), 1-10.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees Belmont. CA: Wadsworth International Group.
Bustamante, C. D., Francisco, M., & Burchard, E. G. (2011). Genomics for the world. Nature, 475(7355), 163-165.
Breiman, L. Manual–setting up, using, and understanding random forests V4. 0. (2003)
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1), 1-6.
Drummond, C., & Holte, R. C. (2000). Exploiting the cost (in) sensitivity of decision tree splitting criteria. In ICML (Vol. 1, No. 1).
Flach, P. A. (2003). The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In ICML (pp. 194-201).
Galimberti, G., & Soffritti, G. (2011). Tree‐Based Methods and Decision Trees. Modern Analysis of Customer Surveys: With Applications Using R, 283-307.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
Hoens, T., & Chawla, N. (2010). Generating diverse ensembles to counter the problem of class imbalance. Advances in Knowledge Discovery and Data Mining, 488-499.
Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (Vol. 398). John Wiley & Sons.
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15(3), 651-674.
Hothorn, T., Hornik, K., Strobl, C. & Zeileis, A. (2015). Package ‘party’. Package Reference Manual for Party Version 0.9-998, 16, 37.
Japkowicz, N. (2001). Concept-learning in the presence of between-class and within-class imbalances. In Conference of the Canadian Society for Computational Studies of Intelligence (pp. 67-77). Springer Berlin Heidelberg.
Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Applied statistics, 119-127.
Kiritchenko, S., & Matwin, S. (2011). Email classification with co-training. In Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research (pp. 301-312). IBM Corp..
Kumar, M., & Sheshadri, H. (2012). On the classification of imbalanced datasets. International Journal of Computer Applications, 44.
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.
Lin, M., Lucas Jr, H. C., & Shmueli, G. (2013). Research commentary—too big to fail: large samples and the p-value problem. Information Systems Research, 24(4), 906-917.
Meyer, D., Zeileis, A., Hornik, K., Meyer, M. D., & KernSmooth, S. (2007). The vcd package. Retrieved October, 3, 2007.
Nguyen, G. H., Bouzerdoum, A., & Phung, S. L. (2009). Learning pattern classification tasks with imbalanced data sets. In Pattern recognition. InTech.
Phua, C., Alahakoon, D., & Lee, V. (2004). Minority report in fraud detection: classification of skewed data. Acm sigkdd explorations newsletter, 6(1), 50-59.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1: 81-106.
Quinlan, J. R. (1993). C4. 5: Programming for machine learning. Morgan Kauffmann, 38.
Rokach, L., & Maimon, O. (2005). Decision trees. Data mining and knowledge discovery handbook, 165-192.
Shmueli, G., Patel, N. R., & Bruce, P. C. (2016). Data Mining for Business Analytics: Concepts, Techniques, and Applications with XLMiner. John Wiley & Sons.
Shmueli, G. (2010). To explain or to predict?. Statistical science, 289-310.
Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(04), 687-719.
Therneau, T. M., Atkinson, B., & Ripley, B. (2010). rpart: Recursive partitioning. R package version, 3, 1-46.
Weiss, G. M. (2004). Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 6(1), 7-19.
Weiss, G. M. (2009). Mining with rare cases. In Data Mining and Knowledge Discovery Handbook (pp. 747-757). Springer US.
Yahav, I., Shmueli, G., & Mani, D. (2016). A tree-based approach for addressing self-selection in impact studies with big data. MIS Quarterly, 40(4), 819-848
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *