作者(外文):Peng, Kuei-Hsiang
論文名稱(外文):Predicting Personality Traits of Chinese Users Based on Facebook Wall Posts
外文關鍵詞:personalitytext classi cationtext miningChinese text miningmachine learning
因此,我們在本篇論文嘗試透過中文文本來分類一個人的人格特質。首先,我們收集222 位使用中文的臉書使用者的塗鴉牆貼文以及其人格特質分數。接著,應用結巴中文分詞來完成分詞任務,以及使用支持向量機作為分類人格特質的學習演算法。
Automatically recognizing personality is a promising subject as a way to infer a person'sbehaviors. Many studies have been performed in recent years. However, very few of them are focus on predicting personality from Chinese texts. Chinese texts are very different from English texts where words are separated by the spaces. A Chinese sentence consists of a sequence of characters with no space between them. But a character is not a meaningful unit, a word is. This makes it more dicult to analyze Chinese texts since the boundaries of words are not obvious.
In this thesis, we attempt to classify the personality traits from Chinese texts. We collected a dataset with posts and personality scores of the 222 Facebook users who use Chinese as their main written language. Then, the Jieba Chinese text segmentation was employed to accomplish the text segmentation task, and SVM was used as a learning algorithm for personality classi cation.
Experimental results show that the performance in precision and recall gain much improvement with the help of text segmentation and considering both the text and friend features yields the best performance. Moreover, we nd that extraverts seem to write more sentences and use more common words than introverts do. This indicates that the extraverts are more willing to share their mood and life with others than the introverts.
List of Figures
List of Tables
1 Introduction
2 Background
2.1 Big Five Model
2.2 Related work
3 Methods 11
3.1 Text feature extraction algorithms
3.1.1 Bag-of-words model
3.1.2 Chinese text segmentation
3.1.3 Weighted schemes
3.2 Feature selection algorithms
3.2.1 Chi-squared test
3.2.2 Recursive Feature Elimination
3.3 Classi cation algorithms
3.3.1 Support vector machines
4 Experiments and results
4.1 Data collection
4.2 Statistical characteristics of the dataset
4.3 Evaluation Metrics
4.3.1 Accuracy
4.3.2 Precision and recall
4.3.3 Negative predictive value and True Negative Rate
4.4 Experiments and results
4.4.1 Experimental setup
4.4.2 Tokenizing with text segmentation
4.4.3 Classifying using document-term matrix in TF or TF-IDF weighted scheme
4.4.4 Classifying using di erent feature selection approaches
4.4.5 Classifying using both text and friend features
4.4.6 Comparing the selected features of these experiments
5 Conclusion
