1st International and 10th National Iranian Conference on Bioinformatics
Comparison of Random Forest and Boosted Regression Tree in improving predicted affinity
Paper ID : 1072-ICB10
Authors:
Sara Mohammadi *1, Zahra Narimani2, Mitra Ashouri1, Mohammad Hossein Karimi‐Jafari1
1Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
2Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran
Abstract:
One of the challenges in predicting protein-ligand affinity is how target flexibility should be considered in the docking procedure. Recently, ensemble docking has gained increasing attention and is incorporated as a promising solution to this problem [1], however, there is still missing information on how an optimal set of conformations can be chosen in order to reduce computational costs and also the number of false positives in pose prediction [2]. In order to generate an efficient ensemble of CDK2 X-ray structures, a robust graph-based selection algorithm is proposed, using which, 126 non-redundant CDK2 structures are selected in the ensemble dataset. A diverse set of ligands extracted from ChEMBL, and docked to the non-redundant receptor ensemble. A feature set of 512 features including feature energetics of docking results, beside other eight simple molecular features of ligands, are considered in the final dataset for the machine learning (ensemble-based) affinity prediction method. The use of machine learning eliminates the need of using classical scoring functions such as force-field, knowledge-based and empirical function, which are prone to limitations with increase in training data size [3-5]. In this study, Random Forest (RF) and Boosted Regression Trees (BRT) ensemble learning algorithms are used for final affinity prediction [6-7]. Finally, the impurity importance value of RF method is used in order to choose CDK2 structures [8] which play a more important role in ensemble docking. Experiments show that docking to only those receptors selected by RF, reduces the error and also error skewness. Finally, using the mentioned methods, a 〖MSE〗_RF=1.3,〖Rp〗_RF=0.5 for RF and 〖MSE〗_BRT=1.37,〖Rp〗_BRT=0.52 for BRT is obtained (hyperparameters set to the default values and models iterated 50 times). By letting machine learning select important features, an accuracy of 1kcal/mol is achieved, which is significantly better than methods not based on machine learning.
Keywords:
Ensemble docking; Ensemble learning; Random forest; Boosted Regression Tree; CDK2
Status : Paper Accepted (Oral Presentation)