1st International and 10th National Iranian Conference on Bioinformatics
TM-Bench: A benchmark dataset for thermophilic-mesophilic proteins classification
Paper ID : 1192-ICB10
Authors:
Saber Mohammadi *1, Seyed Shahriar Arab1, JAVAD ZAHIRI2, Danial Khadivi1
1Tarbiat Modares University
25 Department of Neuroscience, University of California San Diego, La Jolla, CA, USA
Abstract:
Recently, machine learning approaches have become conventional methods in order to solve biological problems. Thermal stability of thermophilic and hyper-thermophilic proteins has made them suitable candidates for medical and industrial applications [1], [2]. Thus, various machine learning methods have been introduced to predict the thermophilic proteins and discriminate them from their mesophilic counterparts based on the sequence information of these proteins. Most of these studies have reported accuracies of more than 90 percent, whereas it seems to be optimistic. Using an inappropriate dataset can be the main source of this overestimation. Hence, comparing the various approaches has become challenging due to the lack of a gold standard dataset. Here we introduce TM-Bench dataset. Zhang and Fang made an effort for the first time to discriminate thermophilic and mesophilic proteins via pattern recognition methods [3]. Since then, a variety of approaches such as SVM, artificial neural network, decision tree, k-nearest neighbor, genetic algorithm, and Naive Bayes have been adopted for the classification of thermophilic and non-thermophilic proteins solely based on proteins sequence information [4]–[13]. In this study, we used the BacDive database [14] in order to extract a list of thermophilic and mesophilic organisms based on their optimum growth temperature. Next, after having extracted the corresponding protein sequences from Swiss-Prot database [15], redundancies in the primary dataset were removed by the CD-HIT tool [16]. Subsequently, the balanced and imbalanced datasets were fed to the above-mentioned methods for re-evaluating their performance. Our results indicate that sensitivity, specificity, and accuracy were lower than previously reported measures for balanced data set, and with imbalanced data set, sensitivity drops dramatically. Overall, Multi-Layer Perceptron and Logit Boost showed better performance than other methods with the balanced dataset, with 81% and 78% accuracy, the sensitivity of 82% and 79%, and specificity of 80% and 77%, respectively.
Keywords:
Machine learning, thermophilic protein, mesophilic protein, neural network, SVM
Status : Paper Accepted (Poster Presentation)