帳號:guest(3.135.188.182)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):黃昭學
作者(外文):Huang, Jau-Shiue
論文名稱(中文):利用基於圖的深度學習定量結構性質關係預測溶解度性質
論文名稱(外文):Graph-based Deep Learning Quantitative Structure Properties Relation for Solvents
指導教授(中文):汪上曉
指導教授(外文):Wong, David Shan-Hill
口試委員(中文):林祥泰
姚遠
康嘉麟
口試委員(外文):Lin, Shang Tai
Yao, Yuan
Kang, Jia-Lin
學位類別:碩士
校院名稱:國立清華大學
系所名稱:化學工程學系
學號:109030605
出版年(民國):111
畢業學年度:110
語文別:中文
論文頁數:50
中文關鍵詞:深度學習定量結構性質關係溶解度性質訊息傳遞神經網絡
外文關鍵詞:Deep LearningQuantitative Structure Property RelationSolvent propertiesMessage Passing Neural Networks
相關次數:
  • 推薦推薦:0
  • 點閱點閱:252
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
分子性質的預測在分子設計領域一直是重要的研究議題。其中定量結構特性關係(Quantitative Structure Properties Relation, QSPR) 的開發減輕了計算機輔助分子設計中的實驗和計算負擔。在傳統的 QSPR(如官能基貢獻法)中,分子結構由一系列的官能基(一維整數向量)表示。最近我們開發了一種使用基於簡化分子線性輸入系統 (Simplified Molecular-Input Line-Entry System, SMILES)的深度學習 (Deep Learning, DL) QPSR,並用來預測分子性質。但基於SMILES的輸入本身不包括完整的3D分子架構,因此在一些與分子幾何性質相關的性質預測上有較大的誤差。過去已有研究證實利用具有完整3D結構的分子圖(Molecular graph)在分子性質預測有良好的表現。本文中使用一種稱為.MOL的分子檔案將分子轉換為圖形,並且採用訊息傳遞神經網絡 (Message Passing Neural Networks, MPNNs)的模型架構對Hansen Solubility Parameters in Practice (HSPiP)資料庫中約8800筆有機小分子的性質進行預測。我們發現成功的關鍵是模型預測值必須不被分子圖中節點的順序影響,這種特性被稱為順序不變性,它由訊息傳遞神經網絡的讀出函數來實現。本文也對MPNN架構修改,僅用讀出函數並加入深度神經網絡的模型架構在特定分子性質的預測也會獲得良好的成效,並且也針對不同的讀出函數比較差異。除了HSPiP資料庫中原有的分子性質,我們也利用Material Studio根據資料庫中的分子產生額外的性質並加以預測。本文引入選擇性抽樣使預測效果獲得提升。
Development of a surrogate quantitative structure property relation (QSPR) alleviates experimental or computation burden in computer aided molecular design. In traditional QSPR such as group contribution method, molecular structure is represented by the numbers of functional groups present (a one-dimensional integer vectors). Recently a deep learning (DL) QPSRs using text-based representation of molecular structure known as simplified molecular input linear entry system (SMILES) have been developed by our group to predict molecule properties such as COSMO sigma-profile. However, the text-based input itself does not include full 3-D architecture and inclusion of subunits are required to improve the model. Alternatively, many works demonstrated that good predictions of various properties can be achieved by graph-based representations of molecular structure. Most of such works employed a graph neural network know as Message Passing Neural Network (MPNN). In this work, we used a molecular representation known as .MOL file as the predictor of molecule properties. Using a dataset called HSPiP dataset with around 8800 compounds, we found the key to success is invoking the requirement that the prediction using a graphical input must be independent of the ordering of nodes in the graph. This characteristic is known as “shuffle invariance”, which is guaranteed by an appropriate readout function in graph neural networks. It was found that a good model for property predictions can also be developed with a readout function followed by a deep neural network. Use of MPNN is desirable but not essential. Comparisons of readout functions were also carried out; some improvement can be achieved with the selection of a good readout function. The proposed method was also applied to predict molecule properties in the database and appended properties calculated by the molecular mechanics tool in Material Studio, with .MOL as representation. However, degree of accuracy was found to be inferior even with a good readout function. Selective sampling must be used to substantially improve the predictions.
誌謝 i
摘要 ii
Abstract iii
目錄 Table of Contents iv
圖目錄 List of Figures vii
表目錄 List of Tables viii
第一章 緒論 1
一.1 研究背景 1
一.2 定量結構性質關係 2
一.3 溶解度性質 4
一.3.1 溶解度參數 4
一.3.2 類導體屏蔽模型(COSMO) 、Sigma profile以及活性係數 5
一.4 研究動機 8
一.5 章節安排 9
第二章 基於分子圖的深度學習QSPR 10
二.1 分子表示式 10
二.1.1 SMILES 10
二.1.2 分子指紋 10
二.1.3 分子圖 10
二.2 基於SMILES的深度學習QSPR 13
二.2.1 通用數位化學空間 13
二.2.2 Tansformer模型和SMILES+k-mers 13
二.3 生成分子圖 15
二.3.1 .MOL檔案 15
二.3.2 生成分子圖 16
二.4 訊息傳遞神經網絡 18
二.5 順序不變性和讀出函數 19
二.5.1 讀出函數 19
二.5.2 順序不變性 20
二.6 章節總結 23
第三章 模型結構與訓練方法 24
三.1 資料庫 24
三.1.1 HSPiP 資料庫 24
三.1.2 附加性質 24
三.2 模型架構 26
三.3 模型訓練 28
三.4選擇性抽樣 29
三.5 章節總結 30
第四章 結果與討論 31
四.1 溶解度性質 31
四.1.1 預測結果 31
四.1.2 選擇性抽樣 34
四.2 溶解度性質預測模型 35
四.2.1 模型架構 35
四.2.2 預測結果 37
四.2.3 選擇性抽樣 39
四.3 其他性質 41
四.3.1 預測結果 41
四.3.2 選擇性抽樣 43
第五章 結論 47

1. Ng, L.Y., F.K. Chong, and N.G. Chemmangattuvalappil, Challenges and opportunities in computer-aided molecular design. Computers & Chemical Engineering, 2015. 81: p. 115-129
2. Joback, K. G., & Reid, R. C. (1987). Estimation of pure-component properties from group-contributions. Chemical Engineering Communications, 57(1-6), 233-243.
3. Fredenslund, A., Jones, R. L., & Prausnitz, J. M. (1975). Group‐contribution estimation of activity coefficients in nonideal liquid mixtures. AIChE Journal, 21(6), 1086-1099.
4. Roubehie Fissa, M., Lahiouel, Y., Khaouane, L., & Hanini, S. (2019). QSPR estimation models of normal boiling point and relative liquid density of pure hydrocarbons using MLR and MLP-ANN methods. Journal of Molecular Graphics and Modelling, 87, 109-120.
5. Sanchez‐Lengeling, B., Roch, L. M., Perea, J. D., Langner, S., Brabec, C. J., & Aspuru‐Guzik, A. (2019). A Bayesian approach to predict solubility parameters. Advanced Theory and Simulations, 2(1), 1800069.
6. Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1), 31-36.
7. Chang, J.-J., Wong, D. S.-H., Huang, C.-H., Kang, J.-L., Hsu, H.-H., & Lin, S.-T. (2021). Towards a universal digital chemical space for pure component properties prediction. Fluid Phase Equilibria, 527, 112829.
8. Kang, J.-L., Chiu, C.-T., Huang, J. S., & Wong, D. S.-H. (2022). A surrogate model of sigma profile and COSMOSAC activity coefficient predictions of using transformer with SMILES input. Digital Chemical Engineering, 2, 100016.
9. Dalby, A., et al., Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. Journal of Chemical Information and Computer Sciences, 1992. 32(3): p. 244-255.
10. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., & Dahl, G. E. (2017). Neural message passing for quantum chemistry. International conference on machine learning,
11. Yang, K., Swanson, K., Jin, W., Coley, C., Eiden, P., Gao, H., ... & Barzilay, R. (2019). Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling, 59(8), 3370-3388.
12. Hildebrand J H, Scott R L., The Solubility of Nonelectrolytes (3rd ed)[M]. New York, Reinhold, 1950.
13. Hansen, C. M. (2007). Hansen solubility parameters: a user's handbook. CRC press.
14. Klamt, A. (1995). Conductor-like Screening Model for Real Solvents: A New Approach to the Quantitative Calculation of Solvation Phenomena. The Journal of Physical Chemistry, 99(7), 2224-2235
15. Mullins, E., Oldland, R., Liu, Y. A., Wang, S., Sandler, S. I., Chen, C. C., ... & Seavey, K. C. (2006). Sigma-profile database for using COSMO-based thermodynamic methods. Industrial & engineering chemistry research, 45(12), 4389-4415.
16. Lin, S. T., & Sandler, S. I. (2002). A priori phase equilibrium prediction from a segment contribution solvation model. Industrial & engineering chemistry research, 41(5), 899-913.
17. Islam, M. R., & Chen, C. C. (2015). COSMO-SAC sigma profile generation with conceptual segment concept. Industrial & Engineering Chemistry Research, 54(16), 4441-4454.
18. Durant, J. L., Leland, B. A., Henry, D. R., & Nourse, J. G. (2002). Reoptimization of MDL keys for use in drug discovery. Journal of chemical information and computer sciences, 42(6), 1273-1280.
19. Morgan, H. L. (1965). The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. Journal of chemical documentation, 5(2), 107-113.
20. RDKit: Open-source cheminformatics. https://www.rdkit.org
21. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
22. Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. in Advances in neural information processing systems. 2017.
23. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
24. Vinyals, O., Bengio, S., & Kudlur, M. (2015). Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391.
25. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
26. BIOVIA, Dassault Systèmes, Material Studio, 2021, San Diego: Dassault Systèmes, 2021.
27. BIOVIA, Dassault Systèmes, Pipeline Pilot, 21.2.0.2575, San Diego: Dassault Systèmes, 2021.
28. Rappe, A. K., Casewit, C. J., Colwell, K. S., Goddard, W. A., & Skiff, W. M. (1992). UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. Journal of the American Chemical Society, 114(25), 10024-10035.
29. Thiel, W., & Voityuk, A. A. (1992). Extension of the MNDO formalism tod orbitals: Integral approximations and preliminary numerical results. Theoretica chimica acta, 81(6), 391-404.
30. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
31. Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine learning, 15(2), 201-221.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *