帳號:guest(3.144.103.238)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):王鳳鐸
作者(外文):Wang, Feng-To
論文名稱(中文):詞嵌入增強之統計準則式方法於病例去識別化之應用
論文名稱(外文):Statistical Principle-based approach for De-identification of Electronic Medical Records
指導教授(中文):許聞廉
陳宜欣
指導教授(外文):Hsu, Wen-Lian
Chen, Yi-Shin
口試委員(中文):戴鴻傑
張詠淳
口試委員(外文):Dai, Hong-Jie
Chang, Yung-Chun
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:108062585
出版年(民國):111
畢業學年度:110
語文別:中文
論文頁數:42
中文關鍵詞:去識別化語言模板深度學習語意編碼實體命名識別詞嵌入
外文關鍵詞:De-identificationLanguage PatternDeep LearningSemantic encodingNamed Entity RecognitionWord Embedding
相關次數:
  • 推薦推薦:0
  • 點閱點閱:583
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
現今網路的發展下,要取得大量的訓練資料,並非是一件難事,但如果是要取得一些會涉及到個人資訊的資料 (如:合約、病例),就會衍生出相關的法律問題,舉例來說健保署於2016年時,就有因公布健保資料,導致發生行事訴訟。因而,要在這些資料上,去衍生出模型或是應用,往往需要把含有個人訊息的資料(如:人名,身分證字號,地址,電話,年齡…等等)去做改寫或是替換。例如: 可以把原年齡都往上增加十歲,假設原年齡為20歲,便可改成30歲。便能保護原資料者的個人訊息 ,同時在使用這些資料時 ,不會破壞掉原資料的架構與內容。本研究提出以語言模板(Linguistic Patten)為基礎,結合深度學習的做法,建構命名實體識別(Named Entity Recognition)模型,相較於現今實驗大多以深度學習為主,此方法能有較好的可讀性且容易維護,並同時兼具高準確率的優點。我們將以此研究方法建構病例去識別化模型,同時將模型與現今實體識別模型去做比較。
Under the current development of the internet, obtaining large amount of training data is not a challenging task. Yet, acquiring data involved with personal information (such as contracts or medical history) can lead to some relevant legal issues. For example, in 2016, National Health Insurance Administration underwent a criminal procedure for disclosing data of health insurance. Therefore, when making models or applications on such data, it is often needed to modify or alter data containing personal information (including name, ID number, address, phone number, age, and so on). For instance, original age might be added by 10, such as changing age of 20 to age of 30. In that way, the original personal information can be protected, while the frame and contents will not be deconstructed when using these data. Based on linguistic pattern, this study establishes a model of named entity recognition by combining with the method of deep learning. Compared with existing experiments that aims to deep learning, this method has better readability and is easier to maintain, along with possessing the advantage of high accuracy. By adopting this methodology, we shall construct a de-identification model and compare it with the currently-used entity recognition model.
摘要 ii
Abstract iii
目錄 iv
圖目錄 v
表目錄 vi
第一章 簡介 1
1-1 研究方法與動機 1
1-2 論文架構 2
第二章 相關研究 3
2-1 命名實體識別 3
2-2 詞嵌入 4
2-3 統計準則式 5
2-4 本體論 7
2-5資料增強 8
2-6去識別化 10
第三章 基於統計準則式與語意編碼系統 15
3-1 資料類別與標註 15
3-2 語意地圖知識庫生成 18
3-3 語意歧異處理 21
3-4 資料增強 23
3-5 代理產生 25
3-6 多執行緒 27
第四章 結果 28
4-1 去識別化資料集統計資訊 28
4-2 效能評估指標 30
4-2-1 去歧異技術評估 32
4-2-2 基於去歧異技術之去識別化效能比較 34
4-3 資料增強效能比較 35
4-4 去識別化替換 38
第五章 總結與未來工作 39
參考 40

1. Collobert, R., et al., Natural language processing (almost) from scratch. Journal of machine learning research, 2011. 12(ARTICLE): p. 2493− 2537.
2. Yang, Z., R. Salakhutdinov, and W.W. Cohen. Transfer learning for sequence tagging with hierarchical recurrent networks. in International Conference on Learning Representations. 2017.
3. Matthew, et al., Semi-supervised sequence tagging with bidirectional language models. Association for Computational Linguistics, 2017. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers): p. 1756–1765.
4. Mikolov, T., et al., Efficient Estimation of Word Representations in Vector Space. arXiv pre-print server, 2013.
5. Joulin, A., et al. Bag of Tricks for Efficient Text Classification. in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017.
6. Pennington, J., R. Socher, and C.D. Manning, GloVe: Global Vectors for Word Representation. Empirical Methods in Natural Language Processing, 2014.
7. Devlin, J., et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv pre-print server, 2019.
8. Matthew, et al., Deep contextualized word representations. arXiv pre-print server, 2018.
9. Radford, A., et al., Improving language understanding by generative pre-training. 2018.
10. Dai, H.-J., et al., Statistical principle-based approach for recognizing and normalizing microRNAs described in scientific literature. Database, 2019. 2019.
11. Lai, P.T., et al., Using Deep Semantic Learning to Enhance Summarization of Causal-Relationship in Literature: Applied in Chemical-Disease-Gene Pathway. 2019: 國立清華大學.
12. Wei, J. and K. Zou, EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. Association for Computational Linguistics, 2019. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): p. 6382–6388.
13. Lan, Z., et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. in International Conference on Learning Representations. 2020.
14. Sweeney, L., k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002. 10(05): p. 557-570.
15. Dwork, C. Differential privacy: A survey of results. in International conference on theory and applications of models of computation. 2008. Springer.
16. Wen-Lian, H., W. Shih-Hung, and C. Yi-Shiou. Event identification based on the information map-INFOMAP. 2001. IEEE.
17. Wu, Y., et al., Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv pre-print server, 2016.
18. Huang, Z., W. Xu, and K. Yu, Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
19. Lafferty, J., A. McCallum, and F.C. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.
20. Chang, Y.-C., et al. Linguistic template extraction for recognizing reader-emotion. in International Journal of Computational Linguistics & Chinese Language Processing, Volume 21, Number 1, June 2016. 2016.
21. Silvestri, S., et al. A big data architecture for the extraction and analysis of EHR data. in 2019 IEEE World Congress on Services (SERVICES). 2019. IEEE.
22. Wen, H.-C., et al., An assessment of the interoperability of electronic health record exchanges among hospitals and clinics in Taiwan. JMIR medical informatics, 2019. 7(1): p. e12630.
23. Uzuner, Ö. and A. Stubbs, Practical applications for natural language processing in clinical research: The 2014 i2b2/UTHealth shared tasks. Journal of biomedical informatics, 2015. 58(Suppl): p. S1.
24. Yeniterzi, R., et al., Effects of personal identifier resynthesis on clinical text de-identification. Journal of the American Medical Informatics Association, 2010. 17(2): p. 159-168.
25. Chang, N.-W., et al. Statistical principle-based approach for detecting miRNA-target gene interaction articles. in 2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE). 2016. IEEE.
26. Dai, Z., et al. Named entity recognition using bert bilstm crf for chinese electronic health records. in 2019 12th international congress on image and signal processing, biomedical engineering and informatics (cisp-bmei). 2019. IEEE.
27. Toscano, F., et al., Electronic health records implementation: can the European Union learn from the United States? European Journal of Public Health, 2018. 28(suppl_4): p. cky213. 401.
28. Act, A., Health insurance portability and accountability act of 1996. Public law, 1996. 104: p. 191.
29. Gambäck, B. and A. Das. On measuring the complexity of code-mixing. in Proceedings of the 11th International Conference on Natural Language Processing, Goa, India. 2014.
30. Wu, S.-H., et al., FAQ-centered organizational memory, in Knowledge management and organizational memories. 2002, Springer. p. 103-112.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *