帳號:guest(18.221.66.185)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):陳宣佑
作者(外文):Chen, Hsuan-Yu
論文名稱(中文):應用監督與半監督式學習於模糊主題之文獻探討的自動化
論文名稱(外文):Automation of Literature Review of Ambiguous Topics Using Supervised & Semi-Supervised Learning
指導教授(中文):徐茉莉
指導教授(外文):Galit, Shmueli
口試委員(中文):雷松亞
李曉惠
口試委員(外文):Soumya, Ray
Lee, Hsiao-Hui
學位類別:碩士
校院名稱:國立清華大學
系所名稱:服務科學研究所
學號:107078701
出版年(民國):110
畢業學年度:109
語文別:英文
論文頁數:78
中文關鍵詞:自然語言處理深度學習機器學習文獻探討自動化小樣本
外文關鍵詞:NaturalLanguageProcessingDeepLearningMachineLearningLiteratureReviewAutomationSmallData
相關次數:
  • 推薦推薦:0
  • 點閱點閱:139
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
文獻探討是不論學界以及業界都常在做研究時非常耗時的一個部分,特別是當研究者的目標是一些較難定義的概念,例如「大數據」,這些字可能很難透過一些特定的關鍵字去搜索相關的文獻。這份研究的目標在於應用自然語言處理以及機器學習的預測模型,去幫助研究者去自動化這種在模糊目標上的文獻探討的流程,比如研究者一開始先手動標記一些相關的文獻,並透過模型學習這些標注過的文獻的資訊,訓練後的模型可以幫助更快地檢索出相關的文獻。

我們將目標放在檢索作業研究領域中,與「大數據」相關的文獻。這項研究延續了 Mach (2019) 使用事先定義好的詞彙當作特徵的研究,使用 Mach (2019) 從著名的三個作業研究領域的期刊收集之 368 篇文章,Lee et al. (2019) 改良方法使用詞頻當作特徵來預測,而不是事先定義好的詞彙。

與 Mach (2019) 不同,我們也使用類似於 Lee et al. (2019) 用詞頻當作特徵的方法,除了詞頻以外,我們還使用了詞向量的技術 「Universal Sentence Encoder」和「BERT」去產生我們文件的特徵向量,然後我們用這些向量去搭配不同的機器學習以及深度學習的模型,包括:羅吉斯回歸、隨機森林、類神經網路、長短期記憶模型、卷積神經網路。另外,因為我們的資料有許多未標籤的資料,所以我們也做了半監督學習的實驗。

在我們的實驗中,我們發現在各種不同向量與模型的組合下,長短期記憶模型在監督式學習下,在精確率與召回率有最佳的表現。召回率是我們判定標準下比較重要的標準,因為我們的研究目的是模型能找到多少真正有相關的文件。

除此之外,我們的實驗還發現無標籤的資料確實能幫助模型,不但能在原始的測試資料及上有較佳的表現,還能延伸至另一份期刊 (MSOM),但是,當我們測試在第三份期刊時(POM),表現並不如我們的預期,這是典型的「表徵學習轉移」問題。我們的研究發現自然語言處理模型在當訓練集與測試集相似時,確實能在模糊詞義上的文獻探討有較佳的表現,我們建議未來的研究方向可以往增進自動化的 PDF 資訊擷取、實驗更多不同的半監督學習模型、訓練時切分不同的章節、或是自定義的損失函數來調整資訊檢索中不平衡資料的問題。
Researchers in both academia and industry often perform literature reviews. However, literature reviews can be time-consuming, especially when the topics that the researchers are interested in are hard to define, such as “big data”, or there are no well defined keywords to search for the topic. The goal of this research is to help researchers conduct this type of literature review by automating the process, applying natural language processing (NLP) algorithms and machine learning predictive models. Suppose a researcher starts by manually labeling a set of papers. By training a model on these labeled papers, NLP algorithms can help find related papers in the literature.

We focus on the case of searching the operations management (OM) literature for papers on “behavioral big data” (BBD), an ambiguous concept. This expands previous work by Mach (2019) using manually-selected features. Using Mach’s 368 papers collected from top three OM journals, Lee et al. (2019) expanded the approach to use the term frequencies as features in the documents (TF-IDF), which are chosen by algorithms instead of domain knowledge.

In contrast to Mach (2019), and similar to Lee et al. (2019) we use the TF-IDF NLP algorithm to generate our features. In addition to TF-IDF we also use two state-of-the-art embedding techniques -- “Universal Sentence Encoder” and “BERT” to embed our documents’ text as features. Then we use these features with various machine learning and deep learning models, including logistic regression, random forest, deep neural net, LSTM (Long Short Term Memory), and CNN (Convolutional Neural Network). Because most of our data are unlabeled documents, we also expanded our experiments to semi-supervised learning.

In our experiments, we found that among the various models with different features, LSTM performed the best under supervised learning scenario, in terms of precision and recall. Recall is more important in our case because we want to focus on how many related documents are really captured by our algorithmic solution..

Our experiments’ results show that unlabeled data help improve performance not only on the testing set from the same journal but also on one other journal (MSOM). However, when applied to the third journal (POM), the performance did not improve, which is a typical issue of “data shift”. Our work shows that using NLP for literature reviews with ambiguous terms can provide a useful automated solution if the training and test data are sufficiently similar. We suggest that further improvements might be achieved by improving automatic PDF parsing, studying different semi-supervised learning methods, training models on separate sections of a paper, and customizing the models’ loss function for handling the imbalanced data issue.
Acknowledgements ……………………………………….………………………………... ii
List of Tables ……………………………………….………………………………………. v
List of Figures ……………………………………….……………………………………... vi
Abstract ……………………………………….……………………………………………. ix
中文摘要 ……………………………………….………………………………………….. xi
Chapter 1: Introduction and Motivation ………………………………………………….... 13
Chapter 2: Data ……………………………………….……………………………………. 16
2.1 Data Source ……………………………………….……………………………….. 16
2.1.2 Paper Selection ……………………………………….……………………… 17
2.2 Paper Labeling ……………………………………….……………………………. 18
2.3 Data Partitioning ……………………………………….………………………….. 19
2.3.1 Training / Test Partitioning & Cross Validation …………………………….. 19
2.3.2 Additional Test Sets (MS & POMS & MSOM) …………………………….. 21
Chapter 3: Methodology Literature Review ………………………………………………. 22
3.1 Embedding ……………………………………….……………………………….. 23
3.1.1 Bag-of-Words (BoW) and Term Frequency Inverse Document Frequency (TF-IDF) ……………………………………….………………………………….. 24
3.1.2 Neural Networks (NN) and Deep Learning …………………………………. 25
3.1.3 Universal Sentence Encoder (USE) …………………………………………. 26
3.1.4 BERT ……………………………………….……………………………….. 27
3.2 Learning Schemes ……………………………………….………………………… 29
3.2.1 Supervised Learning ……………………………………….………………... 30
3.2.2 Unsupervised Learning ……………………………………….……………... 30
3.2.3 Semi-supervised Learning …………………………………………………... 31
3.3 Classifiers ……………………………………….………………………………... 32
3.4 Evaluation ……………………………………….………………………………... 35
3.4.1 Confusion Matrix ……………………………………….……………………. 36
3.4.2 Precision & Recall, and the PR Curve ………………………………………. 37
3.4.3 Model Explainability ……………………………………….………………... 39
3.5 Implementation in Python ……………………………………….………………… 40
Chapter 4: Modeling & Results ……………………………………….…………………… 42
4.1 Preprocessing ……………………………………….……………………………... 42
4.1.1 Parsing the PDF documents …………………………………………………. 43
4.1.2 Text processing ……………………………………….……………………... 47
4.1.3 Text embedding ……………………………………….……………………... 49
4.2 Performance Results ……………………………………….……………………… 52
4.2.1 Reproducing Performance from Previous Studies …………………………... 53
4.2.2. Performance on Training Set (Management Science 2013 ~ 2017) ……….. 55
4.2.3 Performance comparison of abstract only vs. full text ………………………. 61
4.2.4 Comparing performance of manual and automated parsing …………………. 61
4.2.5 Performance comparison of supervised and semi-supervised approaches …... 62
4.2.6 Performance on Test Set 1 (Management Science 2017 papers) ……………. 63
4.2.7 Performance on Test Set 2 (Other journals: POM, MSOM) …………………. 65
4.2.8 Performance compared to previous work (Mach, 2019) …………………….. 68
4.3 Model Explanation ……………………………………….………………………... 69
Chapter 5: Conclusion, Limitations and Future Directions ………………………………... 73
5.1 Conclusion ……………………………………….………………………………… 73
5.2 Limitations & Future Directions ……………………………………….………….. 74
References ……………………………………….…………………………………………. 76
Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., ... & Kurzweil, R. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175.
Dai, T., & Tayur, S. (2020). OM Forum—Healthcare Operations Management: A Snapshot of Emerging Research. Manufacturing & Service Operations Management, 22(5), 869-887.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Flach, P. A., & Kull, M. (2015). Precision-Recall-Gain Curves: PR Analysis Done Right. In Neural Information Processing Systems Vol. 1, pp. 838-846.
Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets (pp. 267-285). Springer, Berlin, Heidelberg.
Ghahramani Z. (2004). Unsupervised Learning. In Bousquet O., von Luxburg U., Rätsch G. (Eds.), Advanced Lectures on Machine Learning. ML 2003. Lecture Notes in Computer Science, vol 3176. Springer, Berlin, Heidelberg. doi.org:10.1007/978-3-540-28650-9_5
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning, Vol. 1, No. 2. Cambridge: MIT press.
Greaves, F., Ramirez-Cano, D., Millett, C., Darzi, A., & Donaldson, L. (2013). Use of sentiment analysis for capturing patient experience from free-text comments posted online. Journal of medical Internet research, 15(11), e239.
Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02), 107-116.
Kanai, S., Fujiwara, Y., Yamanaka, Y., & Adachi, S. (2018). Sigsoftmax: Reanalysis of the softmax bottleneck. arXiv preprint arXiv:1805.10829.
Keskinocak, P., & Savva, N. (2020). A Review of the Healthcare-Management (Modeling) Literature Published in Manufacturing & Service Operations Management. Manufacturing & Service Operations Management, 22(1), 59-72.
Lee H.-H., Mach P., Shmueli G. and Yahav I. (2019), Automating Literature Reviews: Searching for Behavioral Big Data in Operations Management, 2019 IEEE 5th International Conference on Big Data Intelligence and Computing (DATACOM), Kaohsiung, Taiwan
Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. Advances in neural information processing systems, 27, 2177-2185.
Madabushi, H. T., Kochkina, E., & Castelle, M. (2020). Cost-sensitive BERT for generalisable sentence classification with imbalanced data. arXiv preprint arXiv:2003.11563.
Nevin, B. E. (1993). A minimalist program for linguistics: The work of Zellig Harris on meaning and information. Historiographia Linguistica, 20(2-3), 355-398.
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In International conference on machine learning. Proceedings of Machine Learning Research, pp. 1310-1318.
Ramos, J. (2003, December). Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242, No. 1, pp. 29-48.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). " Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135-1144.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
Rokach, L., & Maimon, O. (2005). Decision trees. In Data mining and knowledge discovery handbook. Springer, Boston, MA.
Terwiesch, C., Olivares, M., Staats, B. R., & Gaur, V. (2020). OM Forum—A Review of Empirical Operations Management over the Last Two Decades. Manufacturing & Service Operations Management, 22(4), 656-668.
Vijayarani, S., Ilamathi, M. J., & Nithya, M. (2015). Preprocessing techniques for text mining-an overview. International Journal of Computer Science & Communication Networks, 5(1), 7-16.
Wang, J., Yu, L. C., Lai, K. R., & Zhang, X. (2016). Dimensional sentiment analysis using a regional CNN-LSTM model. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers), pp. 225-230.
Zhou, X., & Belkin, M. (2014). Semi-supervised learning. In Diniz P. S. R., Suykens J. A. K., Chellappa R., Theodoridis S.. Academic Press Library in Signal Processing, Vol. 1, pp. 1239-1269. Elsevier.
Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., & Lee Giles, C. (2017). Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5315-5324.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *