作者(外文):Huang, Jau-Shiue
論文名稱(外文):Graph-based Deep Learning Quantitative Structure Properties Relation for Solvents
指導教授(外文):Wong, David Shan-Hill
口試委員(外文):Lin, Shang Tai
Yao, Yuan
Kang, Jia-Lin
外文關鍵詞:Deep LearningQuantitative Structure Property RelationSolvent propertiesMessage Passing Neural Networks
分子性質的預測在分子設計領域一直是重要的研究議題。其中定量結構特性關係(Quantitative Structure Properties Relation, QSPR) 的開發減輕了計算機輔助分子設計中的實驗和計算負擔。在傳統的 QSPR(如官能基貢獻法)中,分子結構由一系列的官能基(一維整數向量)表示。最近我們開發了一種使用基於簡化分子線性輸入系統 (Simplified Molecular-Input Line-Entry System, SMILES)的深度學習 (Deep Learning, DL) QPSR,並用來預測分子性質。但基於SMILES的輸入本身不包括完整的3D分子架構,因此在一些與分子幾何性質相關的性質預測上有較大的誤差。過去已有研究證實利用具有完整3D結構的分子圖(Molecular graph)在分子性質預測有良好的表現。本文中使用一種稱為.MOL的分子檔案將分子轉換為圖形,並且採用訊息傳遞神經網絡 (Message Passing Neural Networks, MPNNs)的模型架構對Hansen Solubility Parameters in Practice (HSPiP)資料庫中約8800筆有機小分子的性質進行預測。我們發現成功的關鍵是模型預測值必須不被分子圖中節點的順序影響,這種特性被稱為順序不變性,它由訊息傳遞神經網絡的讀出函數來實現。本文也對MPNN架構修改,僅用讀出函數並加入深度神經網絡的模型架構在特定分子性質的預測也會獲得良好的成效,並且也針對不同的讀出函數比較差異。除了HSPiP資料庫中原有的分子性質,我們也利用Material Studio根據資料庫中的分子產生額外的性質並加以預測。本文引入選擇性抽樣使預測效果獲得提升。
Development of a surrogate quantitative structure property relation (QSPR) alleviates experimental or computation burden in computer aided molecular design. In traditional QSPR such as group contribution method, molecular structure is represented by the numbers of functional groups present (a one-dimensional integer vectors). Recently a deep learning (DL) QPSRs using text-based representation of molecular structure known as simplified molecular input linear entry system (SMILES) have been developed by our group to predict molecule properties such as COSMO sigma-profile. However, the text-based input itself does not include full 3-D architecture and inclusion of subunits are required to improve the model. Alternatively, many works demonstrated that good predictions of various properties can be achieved by graph-based representations of molecular structure. Most of such works employed a graph neural network know as Message Passing Neural Network (MPNN). In this work, we used a molecular representation known as .MOL file as the predictor of molecule properties. Using a dataset called HSPiP dataset with around 8800 compounds, we found the key to success is invoking the requirement that the prediction using a graphical input must be independent of the ordering of nodes in the graph. This characteristic is known as “shuffle invariance”, which is guaranteed by an appropriate readout function in graph neural networks. It was found that a good model for property predictions can also be developed with a readout function followed by a deep neural network. Use of MPNN is desirable but not essential. Comparisons of readout functions were also carried out; some improvement can be achieved with the selection of a good readout function. The proposed method was also applied to predict molecule properties in the database and appended properties calculated by the molecular mechanics tool in Material Studio, with .MOL as representation. However, degree of accuracy was found to be inferior even with a good readout function. Selective sampling must be used to substantially improve the predictions.
誌謝 i
摘要 ii
Abstract iii
目錄 Table of Contents iv
圖目錄 List of Figures vii
表目錄 List of Tables viii
第一章 緒論 1
一.1 研究背景 1
一.2 定量結構性質關係 2
一.3 溶解度性質 4
一.3.1 溶解度參數 4
一.3.2 類導體屏蔽模型(COSMO) 、Sigma profile以及活性係數 5
一.4 研究動機 8
一.5 章節安排 9
第二章 基於分子圖的深度學習QSPR 10
二.1 分子表示式 10
二.1.1 SMILES 10
二.1.2 分子指紋 10
二.1.3 分子圖 10
二.2 基於SMILES的深度學習QSPR 13
二.2.1 通用數位化學空間 13
二.2.2 Tansformer模型和SMILES+k-mers 13
二.3 生成分子圖 15
二.3.1 .MOL檔案 15
二.3.2 生成分子圖 16
二.4 訊息傳遞神經網絡 18
二.5 順序不變性和讀出函數 19
二.5.1 讀出函數 19
二.5.2 順序不變性 20
二.6 章節總結 23
第三章 模型結構與訓練方法 24
三.1 資料庫 24
三.1.1 HSPiP 資料庫 24
三.1.2 附加性質 24
三.2 模型架構 26
三.3 模型訓練 28
三.4選擇性抽樣 29
三.5 章節總結 30
第四章 結果與討論 31
四.1 溶解度性質 31
四.1.1 預測結果 31
四.1.2 選擇性抽樣 34
四.2 溶解度性質預測模型 35
四.2.1 模型架構 35
四.2.2 預測結果 37
四.2.3 選擇性抽樣 39
四.3 其他性質 41
四.3.1 預測結果 41
四.3.2 選擇性抽樣 43
第五章 結論 47

