帳號:guest(52.15.70.136)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):蒂芙妮
作者(外文):Fontenelle-Augustin, Tiffany Natasha
論文名稱(中文):原型的選擇以作有效率的分類
論文名稱(外文):Prototype Selection for Efficient Classification
指導教授(中文):蘇豐文
指導教授(外文):Soo, Von-Wun
口試委員(中文):陳宜欣
陳朝欽
口試委員(外文):Chen, Yi-Shin
Chen, Chaur-Chin
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:103065431
出版年(民國):107
畢業學年度:106
語文別:英文
論文頁數:46
中文關鍵詞:原型的選擇分類大數據
外文關鍵詞:prototype selectionclassificationbig data
相關次數:
  • 推薦推薦:0
  • 點閱點閱:435
  • 評分評分:*****
  • 下載下載:18
  • 收藏收藏:0
摘要
大數據已經成為到處都是且在學術上具相當大意義。在快數成長的大數據, 許多問題企圖要透過操弄數據來預測已經展開。在此論文中, 我們點出企圖處理大數據的計算複雜度問題,並提出一個訣竅可以將既有的分類模型方法變成更適當來幫忙解決這個問題。我們的訣竅不僅更適合處理大數據而且在維持正確率或更加的狀況下更快來處理傳統分類問題。我們的方法包括在傳統分類問題中原生型的選擇。也就是在訓練例子中選擇一組數據作為原生型。我們將訓練例子中其餘的數據拋棄並以原生型數據來訓練分類器有別於傳統方法用整個訓練資料。我們用的分類器學習方法是J48決策樹演算法。我們比較我們的方法(只用原生型)與傳統的決策樹方法及天真貝式法(用整個訓練數據)的正確率與執行時間以評估表現。我們也比較我們方法與傳統方法所使用在訓練時所使用的數據量。我們測試三種大小不同的數據集。實驗發現證明我們的方法百分百快於傳統的方法只有微微下降一點正確率。
Abstract

Big data has become ubiquitous and has become of great significance in academia. With the rapid increase in the enormity of big data, many problems have arisen when trying to manipulate the data for the purpose of forecasting. In this thesis, we highlight the problem of computational complexity when attempting to deal with big data. We propose a heuristic that can help to solve this problem by altering the existing method of classification so that it is more suitable for handling big data, thereby increasing efficiency. Our heuristic would not only be more suitable to handle big data but it would also be faster than traditional classification while keeping the accuracy approximately the same as traditional classification, if not higher. Our heuristic combines prototype selection with the traditional classification process. In our heuristic, a subset of the training data is selected as prototypes. The remaining data in the training set is discarded and we continue the process of classification by training the set of prototypes as opposed to the conventional method of using the entire training set. The learning algorithm used in our heuristic is the J48 decision tree algorithm. We evaluated our heuristic by comparing the classification accuracy and running time of our algorithm (using prototypes) with the traditional decision tree and naïve Bayes algorithms (using the entire training set). We also compared the amount of data used in our training phase versus the amount used in the training phases of conventional methods. We tested the data on five data sets ranging from sizes small to large. Findings prove that for big data, our heuristic saves memory space and is 100% faster than traditional classification with only a slight drop in accuracy.
Contents

Introduction 1
1.1 Statement of the Problem 1
1.2 Research Objective and Contributions 2
1.2.1 Hypothesis 3
1.2.2 Contributions 5
1.3 Related Work 6

Methodology 8
2.1 Definitions and Symbols 8
2.2 Adapted PSC Algorithm 10
2.3 Experiment 16
2.3.1 Random Partitioning 17
2.3.2 Selection of Prototypes 18
2.3.3 Training and Testing using Prototypes in Conjunction with Decision Tree 19
2.3.4 Training and Testing using the Original Training Set 21

Results 23
3.1 Accuracy Results 23
3.2 Time Results 28
3.3 Memory Results 32

Evaluation 37
4.1 Datasets 37
4.2 Metrics 39
4.3 Discussion 40

Conclusion 43
5.1 Summary 43
5.2 Limitations 43
5.3 Future Work 44

References 45

References

[1] G. Halevi and H. Moed, "Special Issue on Big Data," Research Trends, no. 30, pp. 3 - 6, September 2012.

[2] C. Ji, Y. Li, W. Qiu, K. Li and U. Awada, "Big Data Processing in Cloud Computing Environments," International Symposium on Pervasive Systems, Algorithms and Networks, 2012.

[3] "Turn Big Data into Big Value: A practical strategy," Intel White Paper, 2013.

[4] X. Jin, B. W. Wah, X. Chen and Y. Wang, "Big Data Research 2: Significance and Challenges of Big Data," Elsevier, pp. 59 - 64, 2015.

[5] H. Hu, Y. Wen, T. Chua and X. Li, "Toward Scalable Systems for Big Data Analytics: A Technology Tutorial," IEEE Access, vol. 2, pp. 652 - 687, 2014.

[6] J. Lopez, J. Ochoa and J. Trinidad, "Prototype Selection Methods," Computación y Sistemas, vol. 13, no. 4, pp. 449 - 462, 2010.

[7] S. Garcia, J. Derrac, J. Cano and F. Herrera, "Prototype Selection for Nearest Neighbor," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 417 -435, 2012.

[8] D. Wilson and T. Martinez, "Reduction Techniques for Instance-Based Learning Algorithms," Machine Learning, vol. 38, pp. 257-286, 2000.

[9] I. Witten and E. Frank, Data Mining: Practical Machine Learning, Elsevier, 2005.
[10] A. Moore, "Information Gain," Lecture Notes, 2003. [Online]. Available: https://www.autonlab.org/tutorials.

[11] M. Lichman, "UCI Machine Learning Repository," University of California, School of Information and Computer Science, 2013. [Online]. Available: http://archive.ics.uci.edu/ml. [Accessed September 2017].

[12] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, B. Liu and M. Steinbach, "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, no. 1, pp. 1 - 37, 2008.

[13] O. Mangasarian and W. H. Wolberg, "Cancer Diagnosis via Linear Programming," SIAM News, vol. 23, no. 5, p. 1 & 18, 1990.

[14] M. Fayyad and K. Irani, "Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning," Thirteenth International Joint Conference of Artificial Intelligence, pp. 1022-1027, 1993.


 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *