作者(外文):Wang, Feng-To
論文名稱(外文):Statistical Principle-based approach for De-identification of Electronic Medical Records
指導教授(外文):Hsu, Wen-Lian
Chen, Yi-Shin
口試委員(外文):Dai, Hong-Jie
Chang, Yung-Chun
外文關鍵詞:De-identificationLanguage PatternDeep LearningSemantic encodingNamed Entity RecognitionWord Embedding
現今網路的發展下,要取得大量的訓練資料,並非是一件難事,但如果是要取得一些會涉及到個人資訊的資料 (如:合約、病例),就會衍生出相關的法律問題,舉例來說健保署於2016年時,就有因公布健保資料,導致發生行事訴訟。因而,要在這些資料上,去衍生出模型或是應用,往往需要把含有個人訊息的資料(如:人名,身分證字號,地址,電話,年齡…等等)去做改寫或是替換。例如: 可以把原年齡都往上增加十歲,假設原年齡為20歲,便可改成30歲。便能保護原資料者的個人訊息 ,同時在使用這些資料時 ,不會破壞掉原資料的架構與內容。本研究提出以語言模板(Linguistic Patten)為基礎,結合深度學習的做法,建構命名實體識別(Named Entity Recognition)模型,相較於現今實驗大多以深度學習為主,此方法能有較好的可讀性且容易維護,並同時兼具高準確率的優點。我們將以此研究方法建構病例去識別化模型,同時將模型與現今實體識別模型去做比較。
Under the current development of the internet, obtaining large amount of training data is not a challenging task. Yet, acquiring data involved with personal information (such as contracts or medical history) can lead to some relevant legal issues. For example, in 2016, National Health Insurance Administration underwent a criminal procedure for disclosing data of health insurance. Therefore, when making models or applications on such data, it is often needed to modify or alter data containing personal information (including name, ID number, address, phone number, age, and so on). For instance, original age might be added by 10, such as changing age of 20 to age of 30. In that way, the original personal information can be protected, while the frame and contents will not be deconstructed when using these data. Based on linguistic pattern, this study establishes a model of named entity recognition by combining with the method of deep learning. Compared with existing experiments that aims to deep learning, this method has better readability and is easier to maintain, along with possessing the advantage of high accuracy. By adopting this methodology, we shall construct a de-identification model and compare it with the currently-used entity recognition model.
摘要 ii
Abstract iii
目錄 iv
圖目錄 v
表目錄 vi
第一章 簡介 1
1-1 研究方法與動機 1
1-2 論文架構 2
第二章 相關研究 3
2-1 命名實體識別 3
2-2 詞嵌入 4
2-3 統計準則式 5
2-4 本體論 7
2-5資料增強 8
2-6去識別化 10
第三章 基於統計準則式與語意編碼系統 15
3-1 資料類別與標註 15
3-2 語意地圖知識庫生成 18
3-3 語意歧異處理 21
3-4 資料增強 23
3-5 代理產生 25
3-6 多執行緒 27
第四章 結果 28
4-1 去識別化資料集統計資訊 28
4-2 效能評估指標 30
4-2-1 去歧異技術評估 32
4-2-2 基於去歧異技術之去識別化效能比較 34
4-3 資料增強效能比較 35
4-4 去識別化替換 38
第五章 總結與未來工作 39
參考 40

