1st International and 10th National Iranian Conference on Bioinformatics
Integrative analysis of DNA methylation and gene expression to identify gastric cancer diagnostic biomarkers via machine learning approache
Paper ID : 1046-ICB10
Authors:
Maryam Sadat Hosseini1, Maryam Lotfi Shahreza2, Parvaneh Nikpour *1
1Department of Genetics and Molecular Biology, Faculty of Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
2Department of Computer Engineering, Shahreza Campus, University of Isfahan, Iran
Abstract:
Gastric cancer is the second cause of cancer-related deaths. Most patients are diagnosed in a late stage because there are not proper methods for early cancer detection. Therefore, there is an urgent need for finding such biomarkers for this cancer. In the current study, our purpose is to integrate methylation and gene expression data from The Cancer Genome Atlas (TCGA) to find potential diagnostic biomarkers for gastric cancer via bioinformatics and machine learning approaches. DNA Methylation and gene expression data for gastric cancer were downloaded from TCGA using TCGABiolinks R-package. At first, probes locating at sex chromosomes, containing single nucleotide polymorphisms (SNPs) or missing values were removed. Then, we used ChAMP R-package for finding differentially methylated CpGs (DMCs). CpGs with adjusted p-values<0.05 and |delta β| > 0.25 were considered DMCs. Gene expression dataset was normalized via DESeq2 R-package. Genes were considered differentially expressed (DEGs) if they satisfied the threshold of |log2 fold change| > 1 and adjusted p-values<0.05. Since promoter hyper-methylation of tumor suppressor genes is one of the most important observations in cancer , we only continued with hyper-methylated CpGs (38 probes) located in the promoter of downregulated genes. Recursive feature elimination with cross-validation (RFECV) method was used to find features with highest discriminative power between tumoral and normal samples resulting in 4 final probes including cg10604646, cg22083047, cg07730329 and cg12741420. These features where then used for constructing a logistic regression model. We validated these markers in an independent set from GEO database (GSE30601). The area under the curve (AUC) of model was 0.904 indicating that the four markers could achieve excellent performance in distinguishing tumoral and normal gastric samples. Overall, the four high-performance diagnostic signatures built through machine learning approaches can improve gastric cancer precision management upon prospective clinical validation.
Keywords:
Machine learning; DNA methylation; Gene expression; Diagnosis; Gastric cance
Status : Paper Accepted (Oral Presentation)