Identification of Disease Genes Using Gene Expression and Protein–Protein Interaction Data



Fig. 6.1
Schematic flow diagram of the insilico approach for identification of disease genes



Selection of Differentially Expressed Genes

The first step of the integrated method selects a set $${\mathbb {S}}$$ of differentially expressed genes from the whole gene set $${\mathbb {C}}$$ of the given microarray gene expression data set. The gene set $${\mathbb {S}}$$ is selected using the MIMRMS method by maximizing both relevance and significance of genes present in $${\mathbb {S}}$$. In general, the microarray data may contain a number of irrelevant and insignificant genes. The presence of such genes may lead to a reduction in the useful information. On the other hand, a gene set with high relevance and high significance enhances the predictive capability. The current method uses maximum relevance-maximum significance criterion, reported in Chap. 4, to select the relevant and significant genes from high dimensional microarray gene expression data sets.

Let $${{\mathbb C}=\{{\fancyscript{A}}_1,\ldots , {\fancyscript{A}}_i,\ldots ,{\fancyscript{A}}_j,\ldots ,{\fancyscript{A}}_m\}}$$ be the set of $$m$$ genes of a given microarray gene expression data set and $${\mathbb S}$$ is the set of selected genes. Define $${\gamma _{{\fancyscript{A}}_i} ({\mathbb D})}$$ as the relevance of the gene $${{\fancyscript{A}}_i}$$ with respect to the class labels $${\mathbb D}$$ while $${\sigma _{\{{\fancyscript{A}}_i,{\fancyscript{A}}_j\}}({\mathbb D},{\fancyscript{A}}_j)}$$ as the significance of the gene $${{\fancyscript{A}}_j}$$ with respect to the set $${\{{\fancyscript{A}}_i,{\fancyscript{A}}_j\}}$$. The total relevance of all selected genes is $${{\fancyscript{J}}_\mathrm{relev}= \displaystyle {\sum _{{\fancyscript{A}}_i \in {\mathbb S}} \gamma _{{\fancyscript{A}}_i} ({\mathbb D})}}$$, while the total significance among the selected genes is $${{\fancyscript{J}}_\mathrm{signf}= \displaystyle {\sum _{{\fancyscript{A}}_i \ne {\fancyscript{A}}_j \in {\mathbb S}} \sigma _{\{{\fancyscript{A}}_i, {\fancyscript{A}}_j\}}({\mathbb D},{\fancyscript{A}}_j)}}$$. Hence, the problem of selecting a set $${\mathbb S}$$ of relevant and significant genes from the whole set $${\mathbb C}$$ of $$m$$ genes, as reported in Chap. 4, is equivalent to maximize both $${{\fancyscript{J}}_\mathrm{relev}}$$ and $${{\fancyscript{J}}_\mathrm{signf}}$$, that is, to maximize the objective function $${{\fancyscript{J}}= {\fancyscript{J}}_\mathrm{relev} + \beta {\fancyscript{J}}_\mathrm{signf}}$$, where $$\beta $$ is a weight parameter. To solve the above problem, the greedy algorithm, reported in Chap. 4, is used in the current study. Both the relevance and significance of a gene are calculated based on the theory of mutual information [50], as described in Chap. 5, while the definition of significance is exactly same as (4.​7) of Chap. 4 or (9.​7) of Chap. 9.

Selection of Effective Gene Set I

In the second step, a set of effective genes are identified as disease genes. The effective gene set I, as mentioned in Fig. 6.1 and denoted by $${\mathbb {S}_\mathrm{GE}}$$, is a subset of $${\mathbb {S}}$$, and defined as the gene set for which the prediction model or classifier attains its maximum classification accuracy. The K-nearest neighbor (K-NN) rule [18] is used here for evaluating the effectiveness of the reduced gene set for classification. A brief description of the K-NN rule is reported in Chap. 5. The value of K, chosen for the current study, is 1, while the dissimilarity between two samples is calculated as follows:


$$\begin{aligned} D(x_i,x_j)=1-\frac{x_i\cdot x_j}{||x_i||\cdot ||x_j||} \end{aligned}$$

(6.1)
where $$x_i$$ and $$x_j$$ are two vectors representing two tissue samples, $$x_i\cdot x_j$$ is their dot product, and $$||x_i||$$ and $$||x_j||$$ are their moduli. The smaller the $$D(x_i,x_j)$$, the more similar the two samples are.

To calculate the classification accuracy of the K-NN rule, the jackknife test [45] is used, although both independent data set test and subsampling test can also be used. However, jackknife estimators allow to correct for a bias and its statistical error. In the jackknife test, all the samples in the given data set are singled out one-by-one and tested by the classifier trained by the remaining samples. During the process of jackknifing, both the training and testing data sets are actually open, and each sample is in turn moved between the two. The jackknife method is recommended as the standard for error bar calculation. In unbiased situation, the jackknife and the usual error bars agree. Otherwise, the jackknife estimates are improvements, so that one cannot loose. In particular, the jackknife method solves the question of error propagation elegantly and with little efforts involved. Also, it is very much applicable for the data sets with small number of training samples and large number of features or genes. Therefore, in this work, jackknife test is used to evaluate the prediction capability of the K-NN rule.

Selection of Effective Gene Set II

Finally, the effective gene set II, denoted by $${\mathbb {S}_\mathrm{GE+PPI}}$$, is obtained from the PPI data based on the set $${\mathbb {S}_\mathrm{GE}}$$, the effective gene set I. It has been observed that proteins with short distances to each other in the network are more likely to involve in the common biological functions [5, 31, 40, 48], and that interactive neighbors are more likely to have identical biological function than noninteractive ones [27, 32]. This is because the query protein and its interactive proteins may form a protein complex to perform a particular function or involved in a same pathway.

The Search Tool for the Retrieval of Interacting Genes (STRING) [49] is an online database resource that provides both experimental as well as predicted interaction information with a confidence score. In general, the graph is a very useful tool for studying complex biological systems as it can provide intuitive insights and the overall structure property, as demonstrated by various studies on a series of important biological topics [1, 3, 913, 54, 55]. In this work, after selecting the gene set $${\mathbb {S}_\mathrm{GE}}$$, a graph $$G(V,E)$$ is constructed with the PPI data from the STRING using the gene set $${\mathbb {S}_\mathrm{GE}}$$. In between each pair of genes, an edge is assigned in the graph. The weight of the edge $$E$$ in graph $$G$$ is derived from the confidence score according to the relation $$\omega ^G=1000\times (1-\omega ^0)$$, where $$\omega ^G$$ is the weight in graph $$G$$ while $$\omega ^0$$ is the confidence score between two proteins concerned. Accordingly, a functional protein association network with edge weight is generated. In order to identify the shortest path from each of the selected differentially expressed genes of $${\mathbb {S}_\mathrm{GE}}$$ to remaining genes of the set $${\mathbb {S}_\mathrm{GE}}$$ in the graph, Dijkstra’s algorithm [15] is used. Finally, the genes present in the shortest path are picked up and ranked according to their betweenness value. Let this set of genes be $${\mathbb {S}_\mathrm{PPI}}$$. The effective gene set II, that is, $${\mathbb {S}_\mathrm{GE+PPI}}$$, is the union of sets $${\mathbb {S}_\mathrm{GE}}$$ and $${\mathbb {S}_\mathrm{PPI}}$$, that is, $${\mathbb {S}_\mathrm{GE+PPI}=\mathbb {S}_\mathrm{GE} \cup \mathbb {S}_\mathrm{PPI}}$$.



6.3 Experimental Results


In the current integrated method, the disease genes are identified by using both gene expression and PPI data sets. The mutual information-based maximum relevance-maximum significance (MIMRMS) method is used to select differentially expressed genes from microarray data. On the other hand, the method proposed by Li et al. [33] uses minimum redundancy-maximum relevance (mRMR) framework [16, 17]. However, one may also use maximum relevance (MR) method. This section presents the comparative performance analysis of the MIMRMS, mRMR, and MR algorithms. The effectiveness of different algorithms are shown using integrated data consisting of both colorectal gene expression and PPI data.

For colorectal cancer expression data set, 50 top-ranked genes are selected by each gene selection algorithm for further analysis. The jackknife test is used to compute the classification accuracy of the K-NN rule. Based on the accuracy, the effective gene set $${\mathbb {S}_\mathrm{GE}}$$ is identified for each gene selection algorithm. Next, the PPI network is constructed using the gene set $${\mathbb {S}_\mathrm{GE}}$$, and the effective gene set $${\mathbb {S}_\mathrm{GE+PPI}}$$ is obtained based on the shortest path analysis of the constructed PPI network. Finally, the statistical significance analysis is performed on each identified gene set with respect to both known cancer and colorectal cancer genes.


6.3.1 Gene Expression Data Set Used


In this study, the gene expression data from the colorectal cancer study of Hinoue et al. [20] is used. The gene expression profiling of 26 colorectal tumors and matches histologically normal adjacent colonic tissue samples were retrieved from the NCBI Gene Expression Omnibus (http://​www.​ncbi.​nlm.​nih.​gov/​geo/​) with the accession number of GSE25070. The number of genes and samples in this data set are 24526 and 52, respectively. The data set is preprocessed by standardizing each sample to zero mean and unit variance.


6.3.2 Identification of Differentially Expressed Genes


Figure 6.2 represents the predictive accuracy of the K-NN rule obtained using the MR, mRMR, and MIMRMS algorithms. From the figure, it can be seen that the MR and mRMR methods attain 100 % classification accuracy with 8 and 6 genes, respectively, while the MIMRMS method achieves this accuracy with 20 genes. The statistical significance analysis report next confirms that both MR and mRMR methods overestimate the classification accuracy of the K-NN rule compared to the MIMRMS method. In effect, the MIMRMS method is able to find more significant effective gene set compared to both MR and mRMR methods.

A319338_1_En_6_Fig2_HTML.gif


Fig. 6.2
Classification accuracy obtained using different gene selection algorithms


6.3.3 Overlap with Known Disease-Related Genes


The gene set $${\mathbb {S}_\mathrm{GE}}$$ selected by the MIMRMS method is compared with the gene sets $${\mathbb {S}_\mathrm{GE}}$$ obtained by both the MR and mRMR methods, in terms of the degree of overlapping with three gene lists, namely, LIST-1, LIST-2, and LIST-3. The LIST-1 contains 742 cancer-related genes, which are collected from the Cancer Gene Census of the Sanger Centre, Atlas of Genetics and Cytogenetic in Oncology [25], and Human Protein Reference Database [29]. On the other hand, both LIST-2 and LIST-3 consist of colorectal cancer-related genes. While the LIST-2 is retrieved from the study of Sabatas-Bellver et al. [47], the LIST-3 is prepared from the work of Nagaraj and Reverter [38]. While LIST-2 contains 438 colorectal cancer genes, LIST-3 consists of 134 colorectal cancer genes.

The MR method attains highest predictive accuracy with eight genes. Hence, the selected gene set $${\mathbb {S}_\mathrm{GE}}$$ of the MR method contains eight genes, namely, GUCA2B, BEST2, TMIGD, CLDN8, PI16, SCNN1B, CLCA4, and ADH1B. Out of these eight genes, only SCNN1B overlaps with the LIST-1. On the other hand, five genes, namely, GUCA2B, CLDN8, SCNN1B, CLCA4, and ADH1B, overlap with the LIST-2, while only GUCA2B overlaps with the LIST-3. Similarly, the gene set $${\mathbb {S}_\mathrm{GE}}$$ of the mRMR method consists of six genes, namely, CDH3, PI16, GUCA2B, HMGCLL1, BEST2, and SPIB, as the mRMR method achieves highest predictive accuracy with these genes. However, none of them overlaps with the LIST-1. Out of six genes, three genes, namely, CDH3, GUCA2B, and SPIB, overlap with the LIST-2, while two genes, namely, GUCA2B and SPIB, overlap with the LIST-3.

On the other hand, the MIMRMS method provides 100 % classification accuracy of the K-NN rule with 20 genes. Hence, the gene set $${\mathbb {S}_\mathrm{GE}}$$ corresponding to the MIMRMS method consists of 20 genes, namely, GUCA2B, PI16, CILP, SCNN1B, IL8, CA4, BCHE, BEST2, CLCA4, PECI, TMEM37, AFF3, CLDN8, ADH1B, CA1, GNG7, NR3C2, SCARA5, WISP2, and TMIGD. Out of these 20 genes, three genes, namely, CA4, AFF3, and NR3C2, overlap with the LIST-1. On the other hand, eleven genes, namely, GUCA2B, SCNN1B, IL8, CA4, BCHE, CLCA4, AFF3, CLDN8, ADH1B, CA1, and SCARA5, overlap with the LIST-2, while GUCA2B, SCNN1B, IL8, and BCHE overlap with the genes of the LIST-3.
May 25, 2017 | Posted by in GENERAL & FAMILY MEDICINE | Comments Off on Identification of Disease Genes Using Gene Expression and Protein–Protein Interaction Data

Full access? Get Clinical Tree

Get Clinical Tree app for offline access