基於社群專頁內容分析的用戶興趣探勘技術__國立清華大學博碩士論文全文影像系統

帳號：guest(3.142.195.61) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	徐佳
作者(外文):	Xu, Jia
論文名稱(中文):	基於社群專頁內容分析的用戶興趣探勘技術
論文名稱(外文):	Mining User Interests from Social Media Based on Social Group Content Analysis
指導教授(中文):	林嘉文
指導教授(外文):	Lin, ChiaWen
口試委員(中文):	鄭文皇許秋婷胡敏君
口試委員(外文):	Cheng, WenHuang Hsu, ChiuTing Hu, MinChun
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	102061467
出版年(民國):	104
畢業學年度:	103
語文別:	英文
論文頁數:	47
中文關鍵詞:	主題模型、興趣發現、社群網站分析
外文關鍵詞:	Topic model、interest mining、social media analysis
相關次數:	推薦:0 點閱:473 評分: 下載:11 收藏:0

在這篇論文中，我們提出了一種基於社群專頁分析學習得到一個主題模型，從而應用在普通用戶在社群網站分享的圖文內容分析上，找到使用者興趣分佈的方法，該分佈可用於對用戶做精准的廣告推薦。
這篇論文的工作框架分成圖文內容的預處理與特徵提取、帶有標籤的主題空間(Labeled Topic Space)訓練和使用者興趣發現三部分。本研究選取帶有主題標籤的Facebook粉絲專頁的圖文內容作為訓練資料。首先將文字內容做切詞、去除停止詞、抽取關鍵字，圖片內容經過特徵點檢測、分群、抽取關鍵群等過程整理成可供主題模型處理的文字詞彙(Words)和視覺詞彙(Visual Words)。再用濾波方法將主題不明確的圖文內容過濾。然後把每個粉絲專頁的這些詞彙組成一個文字文檔和一個圖片文檔，分別放到LDA( Latent Dirichlet Allocation)主題模型做訓練。每個粉絲專頁經過LDA模型會得到一個文字部分的主題分佈和一個圖片部分的主題分佈。接著找到主題分佈的數值最高的維度。屬於同一個主題標籤的粉絲專頁進行投票(Voting)，選出得票數最高的維度，把這些粉絲專頁對應的標籤賦給該維度。然後再判斷是否每個維度都得到了唯一的標籤，如果否，就要調整LDA的超參數(Dirichlet Parameter)，再次進行訓練，直到每個維度都得到唯一標籤。訓練完的主題空間每個維度就有了具體主題內容。最後應用到普通使用者的社群網站資料上，只要將使用者的圖文內容整理成文檔，放到訓練好的主題模型裡處理，就可得到一個個人化的興趣分佈，每個維度的數值都是該使用者對某興趣的喜好程度，即完成了通過使用者社群網站內容分析發現使用者的興趣分佈。
經實驗結果證明此改進的帶有標籤的LDA主題模型架構在實際應用中的可行性。本研究的貢獻主要有以下四點：可有效地解決傳統非監督式LDA無法建構具體主題空間的問題；通過一定技術手段自動選取文字和圖片中主題更為明確的資料形態來訓練，取長補短，充分利用了多媒體(Muliti-Media)的優勢；改變只能使用超參數的經驗數值的現狀，提出了一種自動選到合適參數的方法；本方法可以使用在普通使用者日常分享的語言和照片內容上，且分類正確度能媲美用語精准的新聞內容的分類結果。
關鍵字：主題模型；興趣發現；社群網站分析。

This thesis presents a model based on social group analysis to get a specific topic space, which can be applied to the general user’s posts helping to mine his interest distribution. The distribution can serve for personalized ads recommendation.
The framework consists of three steps: the preprocessing and feature extraction step, the Labeled Topic Space learning step, and the user interests mining step. The study chooses the Facebook fan pages which have topic labels as the training data. First, for the text posts, do text segmentation, and remove stop words, and extract keywords. Similarly, run feature detection, clustering, and extract key visual words for image contents. Then filter those noisy and ambiguous posts. In order to get better performance of LDA, after aggregating the posts in one fan page into a text document and a photo document, respectively run the LDA (Latent Dirichlet Allocation) model. Each fan page through the LDA model will output a topic distribution of the text document and a topic distribution of the text part. Afterwards, find the highest value dimension of the distribution. Fan Pages sharing the same topic label vote for the dimensions with the highest values of their own distributions. The dimension getting the most votes can be assigned the topic label of these fan pages. Then check whether each dimension has a unique label. If not, it is necessary to adjust the LDA hyper parameters (the Dirichlet Parameter) and run the LDA again. So far, a topic space each dimension of which has a specific and meaningful topic label has been constructed. When the trained model is applied to the general user posts, we can get a personal interest distribution, the value of each dimension representing the user’s preference of certain topic.
The experimental results show that the improved model can effectively mine user’s interests. The main contribution of this study contains four parts: it can solve the problem that the conventional unsupervised LDA can’t reveal the specific meaning of each dimension of the topic space; we propose a method to select the posts which can better explain the topic between texts and photos, taking advantage of multi-media data; the model can automatically choose the appropriate parameters; this method can be applied to the real data shared by users, whose result is comparable to the news data.
Keywords: Topic model; interest mining; social media analysis.

摘要
Abstract
Content
Chapter 1
Introduction
1.1 Research Background
1.2 Motivation and Objective
1.3 Thesis Organization
Chapter 2
Related Work
2.1 Topic models
2.2 Multi-Media Topic Exploration
Chapter 3
Proposed Method
3.1 Overview of Proposed Method
3.2 Feature Extraction
3.3 Model Training
3.4 Model Inference
Chapter 4
Experiments and Discussions
4.1 Data Collection
4.2 Confirmation
4.3 Comparison
4.4 Discussion
Chapter 5
Conclusion 43
Reference 44

[1] J. Tang, R. Hong, S. Yan, T. Chua, G. Qi, R. Jain, Image annotation by k nn-sparse graph-based label propagation over noisily tagged web images, ACM Trans. Intell. Syst. Technol. (TIST) 2 (2011) 14.
[2] J. Tang, S. Yan, R. Hong, G. Qi, T. Chua, Inferring semantic concepts from community-contributed images and noisy tags, in: Proceedings of the MM, 2009, 223–232.
[3] J. Tang, Z. Zha, D. Tao, T. Chua, Semantic-gap-oriented active learning for multilabel image annotation, IEEE Trans. Image Process. 21 (2012) 2354–2360.
[4] H. Feng, X. Qian, Recommend social network users favorite brands, PCM (2013).
[5] X. Qian, X. Liu, C. Zheng, Y. Du, X. Hou, Tagging photos using users' vocabularies, Neurocomputing 111 (2013) 144–153.
[6] J. Weng, E.-P. Lim, J. Jiang, and Q. He. “Twitterrank: finding topic-sensitive influential twitterers”. In WSDM, 2010.
[7] M. Michelson and S.A. Macskassy. “Discovering users’ topics of interest on twitter: A first look”. In Proceedings of the Workshop on Analytics for noisy, Unstructured Text Data, 2010.
[8] J. Chen, R. Nairn, L. Nelson, M. Bernstein, and E. Chi. “Short and tweet: experiments on recommending content from information streams”. In CHI, 2010.
[9] Qiu, Feng, and Junghoo Cho. "Automatic identification of user interest for personalized search." Proceedings of the 15th international conference on World Wide Web. ACM, 2006.
[10] Wang, Xin-Jing, et al. "Argo: intelligent advertising by mining a user's interest from his photo collections." Proceedings of the Third International Workshop on Data Mining and Audience Intelligence for Advertising. ACM, 2009.
[11] Hofmann, T., ―Unsupervised learning by probabilistic latent semantic analysis‖, Machine Learning, 42 (1), 2001, 177- 196.
[12] Deerwester, Scott C., et al. "Indexing by latent semantic analysis." JAsIs 41.6 (1990): 391-407.
[13] Hofmann, Thomas. "Probabilistic latent semantic indexing." Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1999.
[14] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.
[15] Blei, David, and John Lafferty. "Correlated topic models." Advances in neural information processing systems 18 (2006): 147.
[16] Salton, G. and McGill, M. J. 1983 Introduction to modern information retrieval. McGraw-Hill, ISBN 0-07-054484-0.
[17] Papadimitriou, Christos H., and Kenneth Steiglitz. Combinatorial optimization: algorithms and complexity. Courier Corporation, 1998.
[18] Griffiths, D. M. B. T. L., and M. I. J. J. B. Tenenbaum. "Hierarchical topic models and the nested Chinese restaurant process." Advances in neural information processing systems 16 (2004): 17.
[19] Teh, Yee Whye, et al. "Hierarchical dirichlet processes." Journal of the american statistical association 101.476 (2006).
[20] Mimno, David, Wei Li, and Andrew McCallum. "Mixtures of hierarchical topics with pachinko allocation." Proceedings of the 24th international conference on Machine learning. ACM, 2007.
[21] Mcauliffe, Jon D., and David M. Blei. "Supervised topic models." Advances in neural information processing systems. 2008.
[22] Ramage, Daniel, et al. "Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora." Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 2009.
[23] Lacoste-Julien, Simon, Fei Sha, and Michael I. Jordan. "DiscLDA: Discriminative learning for dimensionality reduction and classification."Advances in neural information processing systems. 2009.
[24] Ramage, Daniel, et al. "Clustering the tagged web." Proceedings of the Second ACM International Conference on Web Search and Data Mining. ACM, 2009.
[25] Petinot, Yves, Kathleen McKeown, and Kapil Thadani. "A hierarchical model of web summaries." Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2. Association for Computational Linguistics, 2011.
[26] Perotte, Adler J., et al. "Hierarchically supervised latent Dirichlet allocation."Advances in Neural Information Processing Systems. 2011.
[27] Newman, David, Chemudugunta, Chaitanya, Smyth, Padhraic, and Steyvers, Mark. Analyzing entities and topics in news articles using statistical topic models. Intelligence and Security Informatics, pp. 93–104, 2006.
[28] Liu, Yan, Niculescu-Mizil, Alexandru, and Gryc, Wojciech. Topic-link LDA: joint models of topic and author community. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 665–672. ACM, 2009.
[29] Zhao, Wayne, Jiang, Jing, Weng, Jianshu, He, Jing, Lim, Ee-Peng, Yan, Hongfei, and Li, Xiaoming. Comparing Twitter and traditional media using topic models. Advances in Information Retrieval, pp. 338–349, 2011.
[30] Tang, Jian, et al. "Understanding the limiting factors of topic modeling via posterior contraction analysis." Proceedings of The 31st International Conference on Machine Learning. 2014.
[31] Feng, He, and Xueming Qian. "Mining user-contributed photos for personalized product recommendation." Neurocomputing 129 (2014): 409-420.

電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文