帳號:guest(18.223.170.190)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):陳韋諭
作者(外文):Chen, Wei-Yu
論文名稱(中文):以詞嵌入與概念擷取方法進行生物醫學縮寫的詞義消歧
論文名稱(外文):Disambiguation of Biomedical Abbreviations Using Word Embeddings and Concept Extraction
指導教授(中文):林華君
指導教授(外文):Lin, Hwa-Chun
口試委員(中文):陳俊良
蔡榮宗
口試委員(外文):Chen, Jiann-Liang
Tsai, Jung-Tsung
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:108062527
出版年(民國):110
畢業學年度:109
語文別:英文
論文頁數:50
中文關鍵詞:詞嵌入概念擷取詞義消歧自然語言處理機器學習一體化醫學語言系統
外文關鍵詞:word embeddingconcept extractionword sense disambiguationnatural language processingmachine learningUMLS
相關次數:
  • 推薦推薦:0
  • 點閱點閱:227
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在醫院病歷資料與醫學文獻中,英文縮寫經常被使用。由於許多縮寫擁有多種展開形式,使它們在詞義上模稜兩可,因此縮寫的詞義消歧(Word Sense Disambiguation, WSD)成為了自然語言處理(Natural Language Processing, NLP)領域中的一個重要課題。在這篇論文中,我們提出了一個監督式機器學習的方法來解決這項問題。首先,我們使用了一個事先訓練好的詞向量(Word Embedding)模型和一個一體化醫學語言系統(Unified Medical Language System, UMLS)的概念擷取(Concept Extraction)工具,來建造四種不同的特徵(features):詞向量特徵(word embedding features)、UMLS概念名稱特徵(UMLS concept preferred name features)、UMLS概念原文字詞組特徵(UMLS concept n-gram features)和詞性特徵(part-of-speech features)。接下來,我們選擇了支持向量機(Support Vector Machine, SVM)作為進行機器學習的模型。在我們以美國明尼蘇達大學(University of Minnesota, UMN)的一個公開資料集進行訓練與測試之後,我們能夠以最好的特徵組合與參數組合,在完整75個縮寫的資料集中得到97.17%的準確率,在部分50個縮寫的資料集中獲得96.97%的準確率,並且在部分13個縮寫的資料集中得到98.50%的準確率。最終,相較於其它論文中使用的方法,我們提出的方法能夠得到更好的表現,因此證明了本篇論文的實用性。
Abbreviations are often used in clinical notes and biomedical articles, and the fact that many of them are ambiguous in meaning makes identifying the correct expansion for an abbreviation a vital word sense disambiguation (WSD) task in the natural language processing (NLP) area. In this study, a supervised machine learning solution is proposed for this problem. First, we utilized a pre-trained word embedding model and a Unified Medical Language System (UMLS) concept extraction tool to construct four kinds of features for target sentences: word embedding features, UMLS concept preferred name features, UMLS concept n-gram features and part-of-speech features. Next, we chose Support Vector Machines (SVMs) as our machine learning models. After training and testing with a public dataset from the University of Minnesota (UMN), we were able to get an accuracy of 97.17% for the full dataset of 75 abbreviations, 96.97% for a subset of 50 abbreviations, and 98.50% for a subset of 13 abbreviations with the best features and SVM parameter settings. In the end, we were able to outperform other researchers' method, thus proving our solution to be effective.
Abstract (Chinese) I
Abstract II
Contents III
List of Tables V
1 Introduction 1
2 Related Works 3
2.1 Supervised Machine Learning Approaches 3
2.1.1 Traditional Machine Learning Solutions 3
2.1.2 DeepLearningSolutions 6
2.2 Vector Space Model Based Approaches 7
2.3 Hyperdimensional Computing Approaches 8
2.4 Rule Based Approaches 8
3 Methodology 10
3.1 Dataset 10
3.2 Model 11
3.3 Features 12
3.3.1 Word Embedding Features 12
3.3.2 UMLS Concept Preferred Name Features 13
3.3.3 UMLS Concept N-gram Features 14
3.3.4 Part-of-speech Features 16
4 Experiments 17
5 Discussion 29
5.1 Effectiveness of Word Embedding Features 29
5.2 Effectiveness of UMLS Concept Features 31
5.3 Effectiveness of Combining Multiple Features 34
5.4 Comparing SUM and AVG Features 35
5.5 Comparing Window Sizes 37
5.6 Comparing SVM Parameter Sets 37
5.7 Comparing with Other Papers 38
5.8 Limitations 41
6 Conclusion 43
References 45
[1] Xu, H., Stetson, P. D., & Friedman, C. (2007). A study of abbreviations in clinical notes. In AMIA annual symposium proceedings (Vol. 2007, p. 821). American Medical Informatics Association.
[2] Bodenreider, O. (2004). The unified medical language system (UMLS): in-tegrating biomedical terminology. Nucleic acids research, 32(suppl 1), D267- D270.
[3] Liu, H., Lussier, Y. A., & Friedman, C. (2001). A study of abbreviations in the UMLS. In Proceedings of the AMIA Symposium (p. 393). American Medical Informatics Association.
[4] McInnes, B. T., Pedersen, T., & Carlis, J. (2007). Using UMLS Concept Unique Identifiers (CUIs) for word sense disambiguation in the biomedical domain. In AMIA annual symposium proceedings (Vol. 2007, p. 533). American Medical Informatics Association.
[5] Leroy, G., & Rindflesch, T. C. (2005). Effects of information and machine learn- ing algorithms on word sense disambiguation with small datasets. International Journal of Medical Informatics, 74(7-8), 573-585.
[6] Joshi, M., Pakhomov, S., Pedersen, T., & Chute, C. G. (2006). A comparative study of supervised learning as applied to acronym expansion in clinical reports. In AMIA annual symposium proceedings (Vol. 2006, p. 399). American Medical Informatics Association.
[7] Xu, H., Markatou, M., Dimova, R., Liu, H., & Friedman, C. (2006). Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues. BMC bioinformatics, 7(1), 1-16.
[8] Yu, H., Kim, W., Hatzivassiloglou, V., & Wilbur, J. (2006). A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations. ACM Transactions on Information Systems (TOIS), 24(3), 380-404.
[9] Stevenson, M., Guo, Y., Alamri, A., & Gaizauskas, R. (2009, June). Disambiguation of biomedical abbreviations. In Proceedings of the BioNLP 2009 Workshop (pp. 71-79).
[10] Kim, Y., Hurdle, J., & Meystre, S. M. (2011). Using UMLS lexical resources to disambiguate abbreviations in clinical text. In AMIA Annual Symposium Proceedings (Vol. 2011, p. 715). American Medical Informatics Association.
[11] Moon, S., Pakhomov, S., & Melton, G. B. (2012). Automated disambiguation of acronyms and abbreviations in clinical texts: window and training size considerations. In AMIA annual symposium proceedings (Vol. 2012, p. 1310). American Medical Informatics Association.
[12] Wu, Y., Xu, J., Zhang, Y., & Xu, H. (2015, July). Clinical abbreviation disambiguation using neural word embeddings. In Proceedings of BioNLP 15 (pp. 171-176).
[13] Li, C., Ji, L., & Yan, J. (2015, March). Acronym disambiguation using word embedding. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 29, No. 1).
[14] Finley, G. P., Pakhomov, S. V., McEwan, R., & Melton, G. B. (2016). Towards comprehensive clinical abbreviation disambiguation using machine- labeled training data. In AMIA Annual Symposium Proceedings (Vol. 2016, p. 560). American Medical Informatics Association.
[15] Wang, Y., Zheng, K., Xu, H., & Mei, Q. (2016). Clinical word sense disambiguation with interactive search and classification. In AMIA Annual Symposium Proceedings (Vol. 2016, p. 2062). American Medical Informatics Association.
[16] Joopudi, V., Dandala, B., & Devarakonda, M. (2018). A convolutional route to abbreviation disambiguation in clinical text. Journal of biomedical informatics, 86, 71-78.
[17] Jaber, Areej and Mart ́ınez, P. (2021). Disambiguating Clinical Abbreviations using Pre-trained Word Embeddings. In Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 5 HEALTHINF: HEALTHINF, ISBN 978-989-758-490-9, pages 501-508. DOI: 10.5220/0010256105010508
[18] K ̊ageb ̈ack, M., & Salomonsson, H. (2016). Word sense disambiguation using a bidirectional lstm. arXiv preprint arXiv:1606.03568.
[19] Jin, Q., Liu, J., & Lu, X. (2019). Deep Contextualized Biomedical Abbreviation Expansion. arXiv preprint arXiv:1906.03360.
[20] Li, I., Yasunaga, M., Nuzumlalı, M. Y., Caraballo, C., Mahajan, S., Krumholz, H., & Radev, D. (2019). A neural topic-attention model for medical term abbreviation disambiguation. arXiv preprint arXiv:1910.14076.
[21] Skreta, M., Arbabi, A., Wang, J., & Brudno, M. (2020, April). Training without training data: Improving the generalizability of automated medical abbreviation disambiguation. In Machine Learning for Health Workshop (pp. 233-245). PMLR.
[22] Wen, Z., Lu, X. H., & Reddy, S. (2020). MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining. arXiv preprint arXiv:2012.13978.
[23] Xu, H., Stetson, P. D., & Friedman, C. (2012). Combining corpus-derived sense profiles with estimated frequency information to disambiguate clinical abbreviations. In AMIA annual symposium proceedings (Vol. 2012, p. 1004). American Medical Informatics Association.
[24] Wu, Y., Denny, J. C., Trent Rosenbloom, S., Miller, R. A., Giuse, D. A., Wang, L., ... & Xu, H. (2017). A long journey to short abbreviations: developing an open-source framework for clinical abbreviation recognition and disambiguation (CARD). Journal of the American Medical Informatics Association, 24(e1), e79-e86.
[25] Charbonnier, J., & Wartena, C. (2018). Using word embeddings for unsupervised acronym disambiguation.
[26] Ciosici, M., Sommer, T., & Assent, I. (2019). Unsupervised Abbreviation Disambiguation Contextual disambiguation using word embeddings. arXiv preprint arXiv:1904.00929.
[27] Berster, B. T., Goodwin, J. C., & Cohen, T. (2012). Hyperdimensional computing approach to word sense disambiguation. In AMIA Annual Symposium Proceedings (Vol. 2012, p. 1129). American Medical Informatics Association.
[28] Moon, S., Berster, B. T., Xu, H., & Cohen, T. (2013). Word sense disambiguation of clinical abbreviations with hyperdimensional computing. In AMIA annual symposium proceedings (Vol. 2013, p. 1007). American Medical Informatics Association.
[29] Limsopatham, N., Santos, R. L., Macdonald, C., & Ounis, I. (2011, July). Disambiguating biomedical acronyms using EMIM. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (pp. 1213-1214).
[30] Sabbir, A. K. M., Jimeno-Yepes, A., & Kavuluru, R. (2017, October). Knowledge-based biomedical word sense disambiguation with neural concept embeddings. In 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE) (pp. 163-170). IEEE.
[31] Liu, Y., Ge, T., Mathews, K. S., Ji, H., & McGuinness, D. L. (2018). Exploiting task-oriented resources to learn word embeddings for clinical abbreviation expansion. arXiv preprint arXiv:1804.04225.
[32] Aronson, A. R. (2001). Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In Proceedings of the AMIA Symposium (p. 17). American Medical Informatics Association.
[33] Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
[34] Moon, Sungrim; Pakhomov, Serguei; Melton, Genevieve. (2012). Clinical Abbreviation Sense Inventory. Retrieved from the University of Minnesota Digital Conservancy, https://hdl.handle.net/11299/137703.
[35] Moon, S., Pakhomov, S., Liu, N., Ryan, J. O., & Melton, G. B. (2014). A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources. Journal of the American Medical Informatics Association, 21(2), 299-307.
[36] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
[37] Zhang, Y., Chen, Q., Yang, Z., Lin, H., & Lu, Z. (2019). BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific data, 6(1), 1-9.
[38] Soldaini, L., & Goharian, N. (2016, July). Quickumls: a fast, unsupervised approach for medical concept extraction. In MedIR workshop, sigir (pp. 1-4).
[39] Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understand- ing with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 7(1), 411-420.
[40] Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146.
[41] White, L., Togneri, R., Liu, W., & Bennamoun, M. (2015, December). How well sentence embeddings capture meaning. In Proceedings of the 20th Australasian document computing symposium (pp. 1-8).
(此全文20261021後開放外部瀏覽)
電子全文
中英文摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *