帳號:guest(3.138.32.150)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):張伍賢
作者(外文):Zhang, Wu Xian
論文名稱(中文):Instagram垃圾訊息檢測
論文名稱(外文):Instagram Spam Detection
指導教授(中文):孫宏民
指導教授(外文):Sun, Hung Min
口試委員(中文):許富皓
黃世昆
口試委員(外文):Hsu, Fu Hau
Huang, Shih Kun
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:103062506
出版年(民國):105
畢業學年度:104
語文別:英文
論文頁數:65
中文關鍵詞:社群網路垃圾訊息檢測垃圾訊息機器學習
外文關鍵詞:Social networksInstagramSpam detectionSpamMachine learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:564
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
近幾年來, Instagram 已經成為全球前15大社群網站之一。然而因為 Instagram 龐大的使用者和流量,也造成廣告和垃圾訊息氾濫。此外,使用者利用大量的 hashtag 使得貼文可以被更容易出現在熱門關鍵字的搜尋結果。因此有必要建構一個垃圾訊息偵測模型以減少 Instagram 中垃圾訊息的數量。

因此,我們運用了特徵為基礎的方法和監督式學習的技術去偵測 Instagram 中的垃圾貼文。我們從 Instagram 蒐集使用者賬戶資訊和貼文。為了讓標注貼文更加快速,我們使用兩階段的分群法將相似的貼文分類到同一群,接著根據分群結果標注貼文。對於特徵,我們不只考慮使用者賬戶資訊和貼文的敘述統計量,同時還萃取貼文的圖片隱含的資訊。我們使用 K 折交叉驗證找出最佳的監督式學習模型和模型參數配對,而最佳的模型準確率有 96.27%。
In recent years, Instagram has become one of top 15 online social networks. However, popularity of Instagram also causes advertisement and spam posts flooding. Some accounts use lots of hashtags to let their posts appears in search results of hot keywords. Therefore, it is necessary to build a spam detection model to decrease number of spam posts in Instagram.

We present a scheme applying feature-based method and supervised learning technique to detect spam posts from Instagram. We collect user profiles and media posts from Instagram. To mark media posts quickly, we use two-pass clustering method to group the near-duplicate posts into same cluster, and then mark these posts based on the clustering result. For feature, we consider not only statistics of user profiles and posts, but also information implied in photos, which is different from other researches. We use K-fold cross validation to find best pair of supervised learning model and parameters of the model and accuracy of our best model is 96.27%.
Table of Contents............................................................................................... i
List of Algorithm ............................................................................................... iv
List of Figures.................................................................................................... v
List of Tables...................................................................................................... vi
Listings............................................................................................................... viii
Chapter 1 Introduction................................................................................... 1
1.1 Motivation............................................................................................ 3
1.2 Our Contribution................................................................................. 5
1.3 Organization ........................................................................................ 5
Chapter 2 Background .................................................................................... 6
2.1 Text Mining......................................................................................... 6
2.1.1 n-gram................................................................................... 6
2.1.2 Minhash................................................................................. 7
2.2 Image Processing.................................................................................. 8
2.2.1 Color Difference Histogram................................................... 8
2.3 Machine Learning................................................................................. 11
2.3.1 Unsupervised Learning.......................................................... 11
2.3.2 Supervised Learning.............................................................. 12
Chapter 3 Related Works ............................................................................... 16
3.1 Image Spam Detection......................................................................... 16
3.2 OSN Spam Detection........................................................................... 17
3.2.1 Spam Account Detection....................................................... 18
3.2.2 Spam Post Detection............................................................. 19
Chapter 4 Scheme ........................................................................................... 20
4.1 Overview.............................................................................................. 20
4.2 Scheme................................................................................................. 21
4.2.1 Data Collection..................................................................... 21
4.2.2 Clustering Media Posts ......................................................... 22
4.2.3 Labeling Data........................................................................ 22
4.2.4 Training the Classifier........................................................... 23
4.3 Tools .................................................................................................... 24
4.3.1 Python Package..................................................................... 24
4.3.2 MongoDB.............................................................................. 25
4.3.3 Apache Spark........................................................................ 26
Chapter 5 Implementation.............................................................................. 27
5.1 Overview.............................................................................................. 27
5.2 Environment ........................................................................................ 28
5.3 Data Collection.................................................................................... 29
5.3.1 Implementation Issues........................................................... 31
5.4 Clustering Media.................................................................................. 32
5.4.1 Clustering Phase 1: Minhash Clustering............................... 32
5.4.2 Clustering Phase 2: K-medoids Clustering............................ 34
5.5 Label the Data..................................................................................... 35
5.6 Training Classifiers............................................................................... 37
5.6.1 Feature Extraction................................................................ 37
5.6.2 Find Best Algorithm with It’s Parameters............................ 38
Chapter 6 Experimental Result and Analysis ................................................ 41
6.1 Overview.............................................................................................. 41
6.2 Cross Validation................................................................................... 41
6.3 Execution Time and Throughput ........................................................ 44
6.4 Discussion ............................................................................................ 46
Chapter 7 Conclusion...................................................................................... 48
7.1 Conclusion............................................................................................ 48
7.2 Future Work......................................................................................... 49
Appendices ......................................................................................................... 56
Chapter A Json ............................................................................................... 57
Chapter B Code .............................................................................................. 60
Chapter C Table.............................................................................................. 63
[1] Instagram on the app store. https://itunes.apple.com/tw/app/instagram/ id389801252?mt=8.
[2] Top 15 most popular social networking sites | may 2016. http://www.ebizmba. com/articles/social-networking-websites.
[3] An introduction to instagram. https://www.wordtracker.com/academy/ social-media-marketing/facebook-instagram/introduction-toinstagram.
[4] 8% of instagram accounts are fakes and 30% are inactive, study says. http://www.businessinsider.com/italian-security-researchersfind-8-percent-of-instagram-accounts-are-fake-2015-7.
[5] Instagram deletes millions of accounts in spam purge. http://www.bbc.com/ news/technology-30548463.
[6] Instagram swamped with adult-themed fake profiles. http://news.
softpedia.com/news/instagram-swamped-with-adult-themed-fakeprofiles-498679.shtml.
[7] Hashtagsfor#spammingininstagram, twitter, facebook, tumblr. http://tophashtags.com/hashtag/spamming/.
[8] n-gram. https://en.wikipedia.org/wiki/N-gram.
[9] Raymond J. Mooney. Cs 388: Natural language processing: N-gram language models.
[10] Guang-Hai Liu and Jing-Yu Yang. Content-based image retrieval using color difference histogram. Pattern Recognition, 46(1):188–198, 2013.
[11] 1.10. decision trees. http://scikit-learn.org/stable/modules/tree.html.
[12] Decision tree learning. https://en.wikipedia.org/wiki/Decision_tree_ learning.
[13] Decisiontrees-spark.mllib. http://spark.apache.org/docs/latest/mllibdecision-tree.html.
[14] Zhe Wang, William Josephson, Qin Lv, Moses Charikar, and Kai Li. Filtering Image Spam with Near-Duplicate Detection. In Image Rochester NY, 2007.
[15] Ngo Phuong Nhung and Tu Minh Phuong. An Efficient Method for Filtering Image-Based Spam. In Research, Innovation and Vision for the Future, 2007 IEEE International Conference, pages 96 – 102, 2007.
[16] Sven Krasser, Yuchun Tang, Jeremy Gould, Dmitri Alperovitch, and Paul Judge. Identifying image spam based on header and file properties using C4.5 decision trees and support vector machine learning. In Proceedings of the 2007 IEEE Workshop on Information Assurance, IAW, number June, pages 255–261, 2007.
[17] Francesco Gargiulo and Carlo Sansone. Combining visual and textual features for filtering spam emails. In 2008 19th International Conference on Pattern Recognition, pages 1–4, 2008.
[18] Peizhou He, Xiangming Wen, and Wei Zheng. A simple method for filtering image spam. In Proceedings of the 2009 8th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2009, pages 910–913, 2009.
[19] M.SoranamageswariandC.Meena. AnEfficientFeatureExtractionMethodfor Classification of Image Spam Using Artificial Neural Networks. In Data Storage and Data Engineering (DSDE), 2010 International Conference on, pages 0–3, 2010.
[20] Zhen Xu, Hong-guo Wang, and Zeng-zhen Shao. Evaluation of Image Spam Classification System Based on AHP. In 2009 International Conference on Computational Intelligence and Software Engineering, pages 1–4, 2009.
[21] Pattaraporn Klangpraphant and Pattarasinee Bhattarakosol. PIMSI: A partial image spam inspector. In 2010 5th International Conference on Future Information Technology, FutureTech 2010 - Proceedings, 2010.
[22] Bhaskar Mehta, Saurabh Nangia, Manish Gupta, and Wolfgang Nejdl. Detectingimagespamusingvisualfeaturesandnearduplicatedetection. In Proceeding of the 17th international conference on World Wide Web WWW 08, volume 6, pages 497–506, 2008.
[23] Kobkiat Saraubon and Benchaphon Limthanmaphon. Fast effective botnet spam detection. In ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology, pages 1066–1070, 2009.
[24] Ching-tung Wu, Kwang-ting Cheng, Qiang Zhu, and Yi-leh Wu. Using visual features for anti-spam filtering. In Image Processing, 2005. ICIP 2005. IEEE International Conference on (Volume:3), pages 5–8, 2005.
[25] Hrishikesh B. Aradhye, Gregory K. Myers, and James A. Herson. Image analysis for efficient categorization of image-based spam E-mail. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, volume 2005, pages 914–918, 2005.
[26] Qiao Liu, Zhiguang Qin, Hongrong Cheng, and Mingcheng Wan. Efficient modeling of spam images. In 3rd International Symposium on Intelligent Information Technology and Security Informatics, IITSI 2010, pages 663–666, 2010.
[27] B Byun, C.-H Lee, S Webb, and C Pu. A discriminative classifier learning approach to image modeling and spam image identification. In …. 4th Conference on Email and Anti-Spam, 2007.
[28] Yan Gao, Ming Yang, Xiaonan Zhao, Bryan Pardo, Ying Wu, Thrasyvoulos N. Pappas, and Alok Choudhary. Image spam hunter. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pages 1765–1768, 2008.
[29] Jen Hao Hsia and Ming Syan Chen. Language-model-based detection cascade for efficient classification of image-based spam e-mail. In Proceedings - 2009 IEEE International Conference on Multimedia and Expo, ICME 2009, pages 1182–1185, 2009.
[30] Haiqiang Zuo, Xi Li, Ou Wu, Weiming Hu, and Guan Luo. Image spam filtering using Fourier-Mellin invariant features. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, number c, pages 849–852, 2009.
[31] Haiqiang Zuo, Weiming Hu, Ou Wu, Yunfei Chen, and Guan Luo. Detecting image spam using local invariant features and pyramid match kernel. In Image Rochester NY, volume 9, pages 1187–1188, 2009.
[32] Zhaoyang Qu and Yingjin Zhang. Filtering image spam using image semantics and near-duplicate detection. In 2009 2nd International Conference on Intelligent Computing Technology and Automation, ICICTA 2009, volume 1, pages 600–603, 2009.
[33] Giorgio Fumera, Ignazio Pillai, and Fabio Roli. Spam Filtering Based On The Analysis Of Text Information Embedded Into Images. Journal of Machine Learning Research, 7:2699–2720, 2006.
[34] Spam and phishing in the q3 of 2014. https://securelist.com/analysis/ quarterly-spam-reports/67851/spam-and-phishing-in-the-q3-of2014/.
[35] Battista Biggio, Giorgio Fumera, Ignazio Pillai, and Fabio Roli. A survey and experimental evaluation of image spam filtering techniques. Pattern Recognition Letters, 32(10):1436–1446, 2011.
[36] Abdolrahman Attar, Reza Moradi Rad, and Reza Ebrahimi Atani. A Survey of Image Spamming and Filtering Techniques. Artificial Intelligence Review, 40(1):71–105, 2011.
[37] Mansour Alsaleh, Abdulrahman Alarifi, Abdul Malik Al-Salman, Mohammed Alfayez, and Abdulmajeed Almuhaysin. TSD: Detecting Sybil Accounts in Twitter. In 2014 13th International Conference on Machine Learning and Applications, pages 463–469, 2014.
[38] S Fong, Yan Zhuang, and Jiaying He. Not every friend on a social network can be trusted: Classifying imposters using decision trees. In Future Generation Communication Technology (FGCT), 2012 International Conference on, pages 58–63, 2012.
[39] Qiang Cao, Xiaowei Yang, Jieqi Yu, and Christopher Palow. Uncovering Large Groups of Active Malicious Accounts in Online Social Networks. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security - CCS ’14, pages 477–488, 2014.
[40] Aditi Gupta, Hemank Lamba, and Ponnurangam Kumaraguru. $1.00 per RT #BostonMarathon #PrayForBoston: Analyzing fake content on Twitter. In 2013 APWG eCrime Researchers Summit, pages 1–12, 2013.
[41] Surendra Sedhai and Aixin Sun. HSpam14: A Collection of 14 Million Tweets for Hashtag-Oriented Spam Research. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’15, pages 223–232, 2015.
[42] facebookarchive/python-instagram. https://github.com/facebookarchive/ python-instagram.
[43] ekzhu/datasketch. https://github.com/ekzhu/datasketch.
[44] Numpy. http://www.numpy.org.
[45] scikit-image: Image processing in python. http://scikit-image.org.
[46] Introduction to mongodb. https://docs.mongodb.com/manual/ introduction/.
[47] Apache spark™ - lightning-fast cluster computing. http://spark.apache.org/.
[48] Spark overview. http://spark.apache.org/docs/latest/.
[49] docker-mongo-spark. https://github.com/jack482653/docker-mongospark.
[50] Json and bson. https://www.mongodb.com/json-and-bson.
[51] Instagram rate limits. https://www.instagram.com/developer/limits/.
[52] Advanced uses of python decorators. https://www.codementor.io/python/ tutorial/advanced-use-python-decorators-class-function.
[53] Stop words. https://en.wikipedia.org/wiki/Stop_words.
[54] Spark programming guide. http://spark.apache.org/docs/latest/programming-guide.html.
(此全文未開放授權)
電子全文
摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *