Chang, Ting-Wei
A Vector Model for Relatedness Computation of UMLS CUI
Lin, Hwa-Chun
bioinformatics, unified medical language system, biomedical concept, word embedding, relatedness
UMLS CUI之間的相關性,能夠用於許多生物醫學領域的自然語言處理(Natural Language Processing, NLP)中。在目前已發表的相關研究中,UMLS CUI相關性的計算方法主要分為兩種類型:基於路徑的方法(Path-Based Approach)以及基於文本的方法(Corpus-Driven Approach)。不管是哪一類型的方法,都無法計算任兩個UMLS CUI之間的相關性,而人類撰寫的生物醫學文本(例如:病歷、生物醫學文獻等等)中,通常會包含許多UMLS CUI代表的生物醫學概念,如果無法計算出人類撰寫的生物醫學文本中,所有UMLS CUI之間的相關性,可能會對後續的自然語言處理造成負面的影響。為了解決此問題,本篇論文提出了一個CUI向量模型,包含了UMLS中,所有非過時(non-obsolete)CUI的向量。我們使用三個資料集來檢測我們的CUI向量模型,在計算CUI相關性時的表現,其中,最可靠的資料集(MiniMayoSRS)包含了醫師以及編碼人員對CUI相關性的判斷。利用我們最佳的CUI向量模型計算的相關性,與醫師的判斷之間的Spearman Correlation為0.759;與編碼人員的判斷之間的Spearman Correlation為0.842。最後,我們利用Correlation以及自創的方法,來比較我們的CUI向量模型以及其他研究團隊提出的向量模型。比較的結果顯示,本篇論文提出的CUI向量模型不僅達到了相當高的CUI覆蓋率(coverage),同時也有不錯的表現。
The relatedness between UMLS (Unified Medical Language System) CUI (Concept Unique Identifier) can be used in multiple NLP (Natural Language Processing) tasks. The reported research in this field can be classified into 2 types: Path-Based Approach and Corpus-Driven Approach. There is a common disadvantage in both 2 types that they are not available to compute the relatedness of all possible pairs of UMLS CUI. The human-written biomedical text commonly includes biomedical concepts represented by multiple UMLS CUI, and it may cause an undesirable effect for the following NLP tasks if the relatedness of all possible pairs of UMLS CUI in human-written biomedical text can’t be computed. To solve the problem, this paper presents a vector model of CUI which includes all non-obsolete CUI in UMLS. We use 3 datasets to evaluate the performance of relatedness computation, the most reliable one (MiniMayoSRS) includes judgements made by physicians and biomedical coders. The Spearman correlation between the relatedness computed by our best model and physician’s judgement is 0.759, and it is 0.842 for biomedical coder’s judgement’s. We also compare the performance of our models and others using correlation and a new evaluation by us. The result shows that our best model not only achieves a high CUI coverage, but also maintains a decent performance。
Chapter 1 Introduction------------------------------------------1
Chapter 2 Background--------------------------------------------3
2.1 一體化醫學語言系統(Unified Medical Language System,UMLS)-3
2.2 MetaMap-----------------------------------------------------5
2.3 Word2Vec----------------------------------------------------8
Chapter 3 Related Works----------------------------------------10
Chapter 4 Method-----------------------------------------------12
4.1 人類撰寫的生物醫學文獻摘要分析-----------------------------12
4.2 資料前處理-------------------------------------------------14
4.3 產生訓練文本-----------------------------------------------17
4.4 利用Word2Vec訓練向量模型-----------------------------------22
Chapter 5 Evaluation-------------------------------------------25
5.1 CUI相關性資料集--------------------------------------------25
5.2 實驗結果---------------------------------------------------26
5.3 比較對象---------------------------------------------------27
5.4 比較結果---------------------------------------------------31
Chapter 6 Conclusion-------------------------------------------41
6.1 結論-------------------------------------------------------41
6.2 未來展望---------------------------------------------------42
