帳號:guest(52.14.2.251)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):陳奕方
作者(外文):Chen, Yi-Fang
論文名稱(中文):以SMOTE為基礎之改善不平衡資料集學習的過抽樣方法比較
論文名稱(外文):A Comparison of SMOTE-based Oversampling Methods for Improving Imbalanced Data Learning
指導教授(中文):林華君
指導教授(外文):Lin, Hwa-Chun
口試委員(中文):陳俊良
蔡榮宗
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:106065529
出版年(民國):108
畢業學年度:107
語文別:中文
論文頁數:54
中文關鍵詞:類別不平衡問題過抽樣技術
外文關鍵詞:Over-sampling TechniqueImbalanced Data Set Learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:242
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
當資料集其中一個類別的數量遠大於其他類別的數量時,我們稱這種資料集為不平衡資料集,利用不平衡資料集作為訓練集所得到的分類模型可能具有偏差,其原因在於多數類別資料具有的資訊量遠大於少數類別資料所擁有的,使得分類模型無法得到足夠的資訊量以學習正確的分類規則。
抽樣是增加少數類別資料資訊量的一種解決方法,其中資料合成是較為常見的抽樣策略,有許多演算法以此策略被設計出。多數論文對於介紹演算法採用何種機制達到資料合成著墨較少,本研究首先介紹採取資料合成策略中較為廣為人知的方法SMOTE演算法與數個以SMOTE為基礎衍生的演算法,並以其利用的機制分類,接著使用現實世界的資料集進行實驗模擬,運用效能指標來評估演算法的表現。
When the number of one of the categories is much larger than the others' number within a dataset, we call this dataset imbalanced dataset. In binary classification problem, we call the category with larger number majority, and the other minority. Classification model trained by imbalanced dataset might be bias because the majority is much more informative than the minority.

Sampling is a solution to increase the amount of information of minor minority. Data synthesis is a relatively common sampling strategy, and many algorithms have been designed with this strategy. Most of survey papers don't introduce the mechanisms which sampling algorithms use to achieve data synthesis in detail. This study first introduces the SMOTE algorithm, which is a well-known method in data synthesis strategy, and several SMOTE-based algorithms in detail, and classifies them according to the mechanism they use. Then we use the real-world datasets for simulation experiment to obtain performance metrics, and evaluate the performance of the algorithms.
摘要
Abstract
目錄
圖表目錄
第一章 簡介 7
第二章 研究背景與動機 9
第三章 文獻探討 11
第四章 實驗 23
4.1 效能指標 23
4.2 演算法的參數設定及分類器的參數設定 25
4.3 K-fold交叉驗證 25
4.4 現實世界的資料集 26
4.5 實驗結果 28
第五章 結論 50
參考文獻 51
[1] Y. M. Huang, C. M. Hung, and H. C. Jiau, “Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem,” Nonlinear Analysis: Real World Applications, vol. 7, pp. 720–747, 2006.
[2] M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker, and G. D. Tourassi, “Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance,” Neural Netw, vol. 21, pp. 427-436, 2008.
[3] W. Z. Lu and D. Wang, “Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme,” Sci Total Enviro, vol. 395, pp. 109-116, 2008.
[4] A. Ali, S. M. Shamsuddin, and A. L. Ralescu, “Classification with class imbalance problem: a review,” Advance Soft Compute Appl, vol. 7, pp. 176–204, 2015.
[5] G. Weiss and F. Provost, "Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction", Journal of Artificial Intelligence Research, vol. 19, pp. 315-354, 2003.
[6] N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intelligent data analysis, vol. 6, pp. 429–449, 2002.
[7] S. Visa, A. Ralescu, "The effect of imbalanced data class distribution on fuzzy classifiers-experimental study", Proceedings of IEEE International Conference on Fuzzy Systems, pp. 749-754, 2005.
[8] N. V. Chawla, K. K. W. Bowyer, L. O. Hall, and W. P. K. Meyer, “SMOTE: Synthetic Minority Over-Sampling Technique,” Journal of Artificial Intelligence Research 16, pp. 321–357, 2002.
[9] H. Han, W. Y. Wang, B. H. Mao, "Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning", Proc. ICIC, vol. 3644, pp. 878-887, 2005.
[10] H. M. Nguyen, E. W. Cooper, and K. Kamei, “Borderline Over-sampling for Imbalanced Data Classification,” Int. J. Knowl. Eng. Soft Data Paradigms, vol. 3, no. 1, pp. 4-21, Apr. 2011.
[11] J.A. Sáez, J. Luengo, J. Stefanowski, and F. Herrera, "Managing Borderline and Noisy Examples in Imbalanced Classification by Combining SMOTE with Ensemble Filtering", Intelligent Data Engineering and Automated LearningIDEAL 2014, pp. 61-68, 2014.
[12] J. D. L. Calleja and O. Fuentes, “A Distance-Based Over-Sampling Method for Learning from Imbalanced Data Sets,” FLAIRS Conference, pp. 634–365, 2007.
[13] J. Yun, J. Ha, and J. Lee, “Automatic Determination of Neighborhood Size in SMOTE,” Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication, no. 100, January, 2016.
[14] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, "DBSMOTE: Density-based synthetic minority over-sampling technique", Appl. Intell., vol. 36, no. 3, pp. 664-684, Apr. 2012.
[15] V. García, J. Sánchez, R. Martín-Félez and R. Mollineda, "Surrounding neighborhood-based SMOTE for learning from imbalanced data sets", Progress in Artificial Intelligence, vol. 1, no. 4, pp. 347-362, 2012.
[16] Y. Dong, X. Wang, "A new over-sampling approach: Random-SMOTE for learning from imbalanced data sets" in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer, vol. 7091, pp. 343-352, 2011.
[17] S. Barua, M. M. Islam, X. Yao, and K. Murase, “MWMOTE – majority weighted minority oversampling technique for imbalanced data set learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, pp. 405–425, 2014.
[18] S. Hu, Y. Liang, L. Ma, and Y. He, “MSMOTE: improving classification performance when training data is imbalanced,” Workshop on Computer Science and Engineering, vol. 2, pp. 13–17, 2009.
[19] K. Li, W. Zhang, Q. Lu, X. Fang, "An improved smote imbalanced data classification method based on support degree", Proc. of 2014 InternationalConference IIKI. IEEE, pp. 34-38, 2014.
[20] H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: adaptive synthetic sampling approach for imbalanced learning,” IEEE International Joint Conference on Neural Networks, pp. 1322–1328, 2008.
[21] Q. GU, X. M. WANG, Z. WU, B. NING, and C. S. XIN, “An Improved SMOTE Algorithm Based on Genetic Algorithm for Imbalanced Data Classification,” Journal of Digital Information Management, vol. 14, no. 2, pp. 92–103, 2016.
[22] H. Lee, J. Kim, and S. Kim, “Gaussian-Based SMOTE Algorithm for Solving Skewed Class Distributions,” International Journal of Fuzzy Logic and Intelligent Systems, vol. 17, no. 4, pp. 229–234, December, 2017.
[23] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, “Safe–level–SMOTE: Safe–level–synthetic minority over–sampling Technique for handling the class imbalanced problem,” Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pp. 475-482, 2009.
[24] T. Maciejewski and J. Stefanowski, “Local neighbourhood extension of SMOTE for mining imbalanced data,” Proceeding of the IEEE symposium on computational intelligence and data mining, pp. 104–111, 2011.
[25] M. Gao, X. Hong, S. Chen, C. J. Harris, E. Khalaf, "PDFOS: PDF estimation based over-sampling for imbalanced two-class problems", Neurocomputing, vol. 138, pp. 248-259, 2014.
[26] J. Stefanowski, Wilk Sz, "Improving rule based classifiers induced by MODLEM by selective pre-processing of imbalanced data", Proc. of the RSKD Workshop at the ECML/PKDD Conf., pp. 54-65, 2007.
[27] M. Ester, H. Kriegel, J. Sander, X. Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases", Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 226-231, 1996.
[28] B. Chaudhuri, "A new definition of neighborhood of a point in multi-dimensional space", Pattern Recognit. Lett., vol. 17, no. 1, pp. 11-17, 1996.
[29] D. R. Wilson, T. R. Martinez, "Reduction techniques for instance-based learning algorithms", Mach. Learn., vol. 38, no. 3, pp. 257-286, 2000.
[30] W. Wu, “Chemometric strategies for normalisation of gene expression data Obtained from cDNA microarrays Analytica,” Chimica Acta, pp. 449–464, 2001.
[31] R. J. Lewis. “An introduction to classification and regression tree (CART) analysis,” Annual Meeting of the Society for Academic Emergency Medicine in San Francisco, California, pp. 1–14, 2000.
[32] M. A. Hearst, S. T. Dumais, E. Osman, J. Platt, and B. Scholkopf, “Support Vector Machines,” Intelligent Systems and their Applications, vol. 13, no.4, pp. 18–28, 1998.
[33] L. E. Sucar, “Probabilistic Graphical Models: Principles and Applications, “Springer, 2015.
[34] T. Hastie, R. Tibshirani, and J. Friedman, “The Elements of Statistical Learning: Data Mining, Inference and Prediction,” Springer Series in Statistics, 2001.
[35] KEEL (Knowledge Extraction based on Evolutionary Learning) http://sci2s.ugr.es/keel/ .
[36] A. Fernndez, S. Garca, M. J. del Jesus, F. Herrera, "A study of the behaviour of linguistic fuzzy-rule-based classification systems in the framework of imbalanced data-sets", Fuzzy Sets Syst., vol. 159, no. 18, pp. 2378-2398, 2008.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *