作者(外文):Hsu, Ching-Tang
論文名稱(外文):Use machine learning to establish a predictive model to efficiently estimate the sizes of residue fluctuations in proteins.
指導教授(外文):Yang, Lee-Wei
口試委員(外文):Lin, Che
Yang, Jinn-Moon
外文關鍵詞:protein fluctuationelastic network model (ENM)machine learningdeep neural networks (DNN)molecular dynamics simulation (MD)Shannon entropynative ensembleprotein shapenuclear magnetic resonancex-ray crystallographyeigenvaluerandom forestlinear regression
蛋白質受熱擾動的振盪大小及固有的(intrinsic)構型變化,對蛋白質與蛋白質之間的交互作用及活性抑制至關重要。雖然現今分子動力學模擬 (Molecular Dynamics Simulation, MD)已被大家廣泛的運用來探索蛋白質的結構變化,或是觀察蛋白質與蛋白質或小分子 (ligand)的交互作用等,MD可以用於觀察蛋白質的振盪, 但是該方法需要耗費較大的運算資源與時間(尤其是對蛋白質複合體)。蛋白質振盪可細分為振盪方向及振幅,目前已存在物理模型 (ex. Elastic Network Model, ENM)可以有效的預測蛋白質振盪的方向,但尚未有高效的模型可以從單一蛋白質的三維結構中預測振盪絕對幅度。故本篇欲利用深度神經網路的方法 (Deep Neural Network, DNN)建立快速有效的蛋白質振盪尺度預測模型。DNN屬於監督式學習,需要提供足量的特徵及答案。本篇選定2792個已知三維結構的蛋白質家族作為訓練模型的數據,並由蛋白質序列資訊,結構及ENM-定義的蛋白質動態中,提取蛋白質及各殘基的特性作為模型輸入的特徵。 而訓練集的答案部分為每個殘基的絕對振盪幅度 (root-mean square fluctuation, RMSF),因時間及運算資源考量,無法使用MD取得各個蛋白質家族每個殘基的RMSFMD,故本篇用以下三種方法計算的振盪絕對幅度來逼近MD的模擬結果。 1. RMSFNE,將相同蛋白質不同的三維結構所形成的Native Ensemble視為該蛋白質的運動軌跡,並計算RMSF 2. RMSFB,從X-ray的B-factor資訊中,換算得來 3. RMSFG,利用Gaussian Network Model (GNM)預估。在本篇研究後發現, RMSFB的絕對水準與RMSFMD 最接近(0.354 percentage difference)。 RMSFNE 及RMSFG在各個殘基的相對振盪大小與RMSFMD很高的相關係數 (0.632, 0.673),但絕對振幅卻與RMSFMD相差較遠 (0.610, 0.467 percentage difference)。 因此本篇將RMSFNE與RMSFG平移到與RMSFB相同的絕對水準,形成兩個新的預測目標Shifted RMSFNE (RMSFSNE)及Shifted RMSFG (RMSFSG)。該二個新的預測標準的可以成功的逼近RMSFMD,percentage difference分別為0.477, 0.354。 本篇分別以RMSFB、RMSFSNE及RMSFSG作為訓練答案的深度網路模型,其中RMSFSG與RMSFB訓練出的模型最佳,預測與實驗RMSF的相關係數為0.751, 0.638,將預測的RMSFSG及RMSFB比較RMSFMD後,可發現percentage difference分別為0.319, 0.371,顯示用RMSFSG模型來預測蛋白質在水溶液中的振盪有不錯的效果。我們統計了39個特徵在訓練的模型中所佔的權重,數據顯示有關GNM的特徵都佔有很高的權重,其中Shannon entropy佔有最高權重的比率,說明了GNM 頻率分布對預測蛋白質振盪有很大的幫助,而蛋白質形狀的特徵也佔有很高的權重,表示蛋白質形狀也與蛋白質振盪大小有關。我們將結合ENM (運動方向) 以及RMSFSG (運動大小) 模型來預測蛋白質可能的構型,並將其運用在protein-ligand docking上,以檢驗此方法的應用性。
Thermal fluctuations and intrinsic conformational changes of proteins are extremely important for protein-protein interaction, inhibition and function. Although molecular dynamics simulation (MD) has been widely used to explore the structural changes in the protein and to observe protein-protein/ protein-ligand interactions, using molecular dynamics to simulate protein fluctuations is usually costly and energy-inefficient, especially for large protein complexes. Fluctuation has its size and direction. Currently, existing physical model such as Elastic Network Model (ENM) can efficiently estimate the direction of protein fluctuation. However, there has not been an efficient (more efficient than MD) and well-accepted method to predict absolute sizes of fluctuations. In this study, we use deep neural network (DNN) to establish a predictive model to efficiently estimate the size of residue fluctuation in protein. DNN belongs to supervised learning requiring adequate features and labels (answers). We extracted features from 2792 protein structural clusters to train our model. The features include 39 relevant characteristics extracted from protein sequences, structures and ENM-defined vibrational dynamics. The training target is the size of residue fluctuations in water but we cannot directly obtain that from MD for all the 2792 clusters provided limited time and computing resources. Therefore, we used the following three methods to calculate the absolute size of fluctuation to approximate the MD simulation results - 1. RMSFNE, the root-mean-square- fluctuations (RMSF) of the native ensemble formed by different structures of same proteins, considered as the conformational spread of the protein. 2. RMSFB, the RMSF derived from atom B-factors for structures resolved by X-ray crystallography. 3. RMSFG, the RMSF derived from Gaussian Network Model (GNM) when a force constant of 0.6 kcal/mol/Å2 is used. In this study we found the absolute size of RMSFB is closer to RMSFMD (0.354 percentage difference) than the other two. A high correlation (0.632, 0.673) between RMSFNE, RMSFG and RMSFMD is found but the absolute size of fluctuations do not match well (percentage difference: 0.610, 0.467). As a result we shifted RMSFNE and RMSFG to the absolute level of RMSFB to form new training targets as “shifted” RMSFNE (RMSFSNE) and “shifted” RMSFG (RMSFSG). Two new training targets (RMSFSG, RMSFSNE,) can successfully approach RMSFMD (0.354, 0.477 percentage difference). We therefore use RMSFB, RMSFSNE and RMSFSG as the training targets for the deep learning to predict the sizes of these three RMSFs as well as RMSFMD. The trained RMSFSG and RMSFB models can predict the experimental results with a correlation of 0.751 and 0.638 and predict RMSFMD with a percentage difference of 0.319 and 0.371. The results show that the RMSFSG model can be a good predictor, as comparing with other models, to predict the sizes of fluctuations for solvated proteins. We also examined the weights of 39 features used in the training models. The data show that the features belonging to GNM have an important contribution. Among them, Shannon entropy of low-frequency spectra, GNM covariance and features related to protein shape take a high weighting, but not much of the residue charges/number of H-bonding and secondary structure content in proteins. We will further combine ENM (motional direction) and RMSFSG (motion size) model to predict possible protein conformations to facilitate a better protein-ligand docking.
Table S1. Percentage difference of RMSFB multiply scaling factor and RMSFMD. 66
