Publication Type
Journal Article
Version
publishedVersion
Publication Date
12-2018
Abstract
Background: Predicting disease causative genes (or simply, disease genes) has played critical roles in understandingthe genetic basis of human diseases and further providing disease treatment guidelines. While various computationalmethods have been proposed for disease gene prediction, with the recent increasing availability of biologicalinformation for genes, it is highly motivated to leverage these valuable data sources and extract useful information foraccurately predicting disease genes. Results: We present an integrative framework called N2VKO to predict disease genes. Firstly, we learn the nodeembeddings from protein-protein interaction (PPI) network for genes by adapting the well-known representationlearning method node2vec. Secondly, we combine the learned node embeddings with various biological annotationsas rich feature representation for genes, and subsequently build binary classification models for disease geneprediction. Finally, as the data for disease gene prediction is usually imbalanced (i.e. the number of the causativegenes for a specific disease is much less than that of its non-causative genes), we further address this serious dataimbalance issue by applying oversampling techniques for imbalance data correction to improve the predictionperformance. Comprehensive experiments demonstrate that our proposed N2VKO significantly outperforms fourstate-of-the-art methods for disease gene prediction across seven diseases. Conclusions: In this study, we show that node embeddings learned from PPI networks work well for disease geneprediction, while integrating node embeddings with other biological annotations further improves the performanceof classification models. Moreover, oversampling techniques for imbalance correction further enhances the predictionperformance. In addition, the literature search of predicted disease genes also shows the effectiveness of ourproposed N2VKO framework for disease gene prediction.
Keywords
Disease gene prediction, Node embeddings, Feature learning, Oversampling, Protein-protein interaction
Discipline
Databases and Information Systems | Systems Biology
Research Areas
Data Science and Engineering
Publication
BMC Systems Biology
Volume
12
Issue
Supp 9
First Page
31
Last Page
44
ISSN
1752-0509
Identifier
10.1186/s12918-018-0662-y
Publisher
BMC (part of Springer Nature)
Citation
ATA, Sezin Kircali; OU-YANG, Le; FANG, Yuan; KWOH, Chee-Keong; WU, Min; and LI, Xiao-Li.
Integrating node embeddings and biological annotations for genes to predict disease-gene associations. (2018). BMC Systems Biology. 12, (Supp 9), 31-44.
Available at: https://ink.library.smu.edu.sg/sis_research/4281
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1186/s12918-018-0662-y