, where is the measured expression level of gene in the th sample, and represent the total number of genes and samples, respectively. Each row in the expression table corresponds to one particular gene and each column to a sample [17]. However, for most gene expression data, the number of training samples is still very small compared to the large number of genes involved in the experiments [17]. For example, the colon cancer data set consists of 62 samples and 2,000 genes and the leukemia data set contains 72 samples and 7,129 genes. The number of samples is likely to remain small for many areas of investigation, especially for human data, due to the difficulty of collecting and processing microarray samples [17]. When the number of genes is significantly greater than the number of samples, it is possible to find biologically relevant correlations of gene behavior with the sample categories [37].
However, among the large amount of genes, only a small fraction is effective for performing a certain task. Also, a small subset of genes is desirable in developing gene expression-based diagnostic tools for delivering precise, reliable, and interpretable results. With the gene selection results, the cost of biological experiment and decision can be greatly reduced by analyzing only the marker genes. Hence, identifying a reduced set of most relevant genes is the goal of gene selection. The small number of training samples and a large number of genes make gene selection a more relevant and challenging problem in gene expression-based classification. This is an important problem in machine learning and referred to as feature selection [
9,
31].
In this regard, different feature selection methods [
4,
9,
10,
31,
32,
34,
39,
55,
60] can be used to select discriminative genes from microarray data sets. A detailed survey on different feature selection algorithms is reported in Chap.
4. There are also lots of gene selection algorithms developed to select differentially expressed genes [
58]. One of the popular gene selection method is significance analysis of microarrays [
64], which assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. Other notable gene selection algorithms are reported in [
38,
40,
48,
52,
59,
72].
Due to the high dimensionality of microarray data set, fast, scalable, and efficient feature selection techniques such as univariate filter methods [
3,
13,
25,
33,
36,
62] have attracted most attention. Univariate methods can be both parametric [
2,
15,
49,
63] and non-parametric [
14,
41,
51,
53,
54,
64]. The simplicity of the univariate techniques has made it dominant in the field of gene selection using microarray data. However, the univariate selection methods have certain restrictions and may lead to less accurate classifiers as they do not take into account the gene-gene interactions. Also, the gene sets obtained by these methods contain redundant or similar genes.
The application of multivariate filter methods ranges from simple bivariate interactions [
5] to more advanced solutions exploring higher order interactions such as correlation-based feature selection [
20,
61,
68,
73] and several variants of the Markov blanket filter method [
16,
45,
70]. There also exist a number of feature selection algorithms that group correlated features to reduce the redundancy among the selected features [
7,
8,
20,
21,
26,
30,
47]. The uncorrelated shrunken centroid [
74] and minimum redundancy-maximum relevance (mRMR) [
10,
55] algorithms are two important multivariate filter procedures, highlighting the advantage of using multivariate methods over univariate procedures in the gene expression domain. The mRMR method selects a subset of genes from the whole gene set by maximizing the relevance and minimizing the redundancy of the selected genes. An
-information measure-based method has been reported in [
43] for selection of discriminative genes from microarray data using the mRMR criterion. In this regard, it should be noted that the mRMR criterion is also used in [
23] and [
44] for gene selection, based on the concepts of neighborhood mutual information and fuzzy-rough sets, respectively.
Gene selection using wrapper or embedded methods offers an alternative way to perform a multivariate gene subset selection, incorporating the classifiers; bias into the search and thus offering an opportunity to construct more accurate classifiers. In the context of microarray analysis, most wrapper methods use population-based, randomized search heuristics [
4,
29,
35,
50], although some methods use sequential search techniques [
24,
71]. An interesting hybrid filter-wrapper approach is introduced in [
57], integrating a univariately preordered gene ranking with an incrementally augmenting wrapper method. The embedded capacity of several classifiers to discard input features and thus propose a subset of discriminative genes, has been exploited by several authors. Examples include the use of random forests, a classifier that combines many single decision trees, in an embedded way to calculate the importance of each gene [
6,
28,
65]. Another line of embedded feature selection techniques uses the weights of each feature in linear classifiers such as support vector machine [
19] and logistic regression [
42]. These weights are used to reflect the relevance of each gene in a multivariate way, and thus allow for the removal of genes with very small weights.
In gene selection process, an optimal gene subset is always relative to a certain criterion. In general, different criteria may lead to different optimal gene subsets. However, every criterion tries to measure the discriminating ability of a gene or a subset of genes to distinguish different class labels. To measure the gene-class relevance, different statistical and information theoretic measures such as the
-test,
-test [
10,
34], entropy, information gain, mutual information [
10,
55], normalized mutual information [
39], and
-information measures [
43] are typically used, and the same or a different metric-like mutual information,
-information, the
distance, Euclidean distance, and Pearson’s correlation coefficient [
10,
27,
55] is employed to calculate the gene-gene redundancy. However, as the
-test,
-test, Euclidean distance, and Pearson’s correlation depend on the actual gene expression values of the microarray data, they are very much sensitive to noise or outlier of the data set [
10,
22,
27,
55]. On the other hand, as information measures depend only on the probability distribution of a random variable rather than on its actual values, they are more effective to evaluate both gene-class relevance and gene-gene redundancy [
18,
39,
55].
However, measures of the distance between a joint probability distribution and product of the marginal distributions are information measures [
43,
56,
66]. Information measures constitute a subclass of the divergence measures, which are measures of the distance between two arbitrary distributions. A specific class of information (respectively, divergence) measures, of which mutual information is a member, is formed by the
-information (respectively,
-divergence) measures [
43,
56,
66]. In this chapter, several
-information measures are compared with mutual information by applying them to the selection of genes from microarray data. The performance of different information measures is studied using the predictive accuracy of naive Bayes classifier, K-nearest neighbor rule, and support vector machine. The effectiveness of different
-information measures, along with a comparison with mutual information, is demonstrated on three cancer microarray data sets, namely, breast cancer, leukemia, and colon cancer data sets.
The structure of the rest of this chapter is as follows: The problem of gene selection from microarray data sets using several information theoretic measures is described in Sect.
5.2, along with a brief description of different
-information measures. A few case studies and a comparison among different
-information measures are reported in Sect.
5.3. Concluding remarks are given in Sect.
5.4.
5.2 Gene Selection Using -Information Measures
In microarray data analysis, the data set may contain a number of redundant genes with low relevance to the classes. The presence of such redundant and nonrelevant genes leads to a reduction in the useful information. Ideally, the selected genes should have high relevance with the classes while the redundancy among them should be as low as possible. The genes with high relevance are expected to be able to predict the classes of the samples. However, the prediction capability is reduced if many redundant genes are selected. In contrast, a data set that contains genes not only with high relevance with respect to the classes but with low mutual redundancy is more effective in its prediction capability. Hence, to assess the effectiveness of the genes, both relevance and redundancy need to be measured quantitatively. In this chapter, the minimum redundancy-maximum relevance framework of Ding and Peng [
10,
55] is used to select a set of relevant and nonredundant genes from microarray gene expression data sets.
5.2.1 Minimum Redundancy-Maximum Relevance Criterion
Let
be the set of
genes of a given microarray gene expression data set and
is the set of selected genes. Define
as the relevance of the gene
with respect to the class label
while
as the redundancy between two genes
and
. The total relevance of all selected genes is, therefore, given by
while the total redundancy among the selected genes is
Therefore, the problem of selecting a set
of relevant and nonredundant genes from the whole set
of
genes is equivalent to maximize
and minimize
, that is, to maximize the objective function
, where
To solve the above problem, a greedy algorithm is widely used that follows next [
10,
55]:
1.
Initialize
.
2.
Calculate the relevance
of each gene
.
3.
Select gene
as the most relevant gene that has the highest relevance
. In effect,
and
.
4.
Repeat the following two steps until the desired number of genes is selected.
5.
Calculate the redundancy between already selected genes of
and each of the remaining genes of
.
6.
From the remaining genes of
, select gene
that maximizes
As a result of that,
and
.
5.2.2 -Information Measures for Gene Selection
In this chapter, different
-information measures are reported to compute both gene-class relevance and gene-gene redundancy for selection of genes from microarray data. The
-information measures calculate the distance between a given joint probability
and the joint probability when the variables are independent
. In the following analysis, it is assumed that all probability distributions are complete, that is,
.
The extent to which two probability distributions differ can be expressed by a so-called measure of divergence. Such a measure will reach a minimum value when the two probability distributions are identical and the value increases with increasing disparity between the two distributions. A specific class of divergence measures is the set of
-divergence measures [
56,
66]. For two discrete probability distributions
and
, the
-divergence is defined as
The demands on the function
are that
1.
;
2.
is continuous and convex on
;
3.
finite on
; and
4.
strictly convex at some point
.
The following definition completes the definition of
-divergence for the two cases for which (
5.6) is not defined:
5.2.2.1 V-Information
One of the simplest measures of dependence can be obtained using the function
, which results in the
-information [
56,
66]
where
,
, and
represent two marginal probability distributions and their joint probability distribution, respectively. Hence, the
-information calculates the absolute distance between joint probability of two variables and their marginal probabilities’ product.
5.2.2.2 -Information
The
-information is defined as [
56,
66]
for
. The class of
-information includes mutual information, which equals
for the limit
. That is,
5.2.2.3 -Information
The
-information, defined by Matusita [
56,
66], is as follows:
When applying this function in the definition of an
-information measure, the resulting
-information measures are
for
. These constitute a generalized version of
-information. That is, the
-information is identical to
-information for
.
5.2.2.4 -Information
The class of
-information measures, proposed by Liese and Vajda [
66], is as follows:
5.2.2.5 Renyi Distance
The Renyi distance, a measure of information of order
[
56,
66], can be defined as
for
. It reaches its minimum value when
and
are identical, in which case the summation reduces to
. As complete probability distribution is assumed, the sum is one and the minimum value of the measure is, therefore, equal to zero. The limit of Renyi’s measure for
approaching 1 equals
, which is the mutual information.
5.2.3 Discretization
In microarray gene expression data sets, the class labels of samples are represented by discrete symbols, while the expression values of genes are continuous. Hence, to measure both gene-class relevance of a gene with respect to class labels and gene-gene redundancy between two genes using information theoretic measures such as mutual information [
10,
55], normalized mutual information [
39], and
-information measures [
43], the continuous expression values of a gene are divided into several discrete partitions. The a prior (marginal) probabilities and their joint probabilities are then calculated to compute both gene-class relevance and gene-gene redundancy using the definitions for discrete cases. In this chapter, the discretization method reported in [
10,
43,
55] is employed to discretize the continuous gene expression values. The expression values of a gene are discretized using mean
and standard deviation
computed over
expression values of that gene: any value larger than
is transformed to state 1; any value between
and
is transformed to state 0; any value smaller than
is transformed to state
. These three states correspond to the over-expression, baseline, and under-expression of genes.
5.3 Experimental Results
The performance of different
-information measures is extensively compared with that of mutual information and normalized mutual information. Based on the argumentation given in Sect.
5.2.2, the following information measures are chosen to include in the study:
1.
– and
-information measures for
and
;
2.
mutual information (
– and
-information);
3.
-information measure for
;
4.
-information measure for
.
In this chapter, these measures are applied to calculate both gene-class relevance and gene-gene redundancy. The minimum redundancy-maximum relevance (mRMR) criterion [
10,
55] is used for gene selection. The source code of the
-information based mRMR (
-mRMR) algorithm [
43], written in C language, is available at
http://www.isical.ac.in/~bibl/results/fmRMR/fmRMR.html. All the information measures are implemented in C language and run in LINUX environment having machine configuration Pentium IV, 3.2 GHz, 1 MB cache, and 1 GB RAM.
To analyze the performance of different
-information measures, the experimentation is done on three microarray gene expression data sets. The major metric for evaluating the performance of different measures is the classification accuracy of support vector machine (SVM) [
67], K-nearest neighbor (K-NN) rule [
12], and naive Bayes (NB) classifier [
12].
5.3.1 Gene Expression Data Sets
In this chapter, three public data sets of cancer microarrays are used. Since binary classification is a typical and fundamental issue in diagnostic and prognostic prediction of cancer, different
-information measures are compared using following binary-class data sets.
5.3.1.1 Breast Cancer Data Set
The breast cancer data set contains expression levels of 7,129 genes in 49 breast tumor samples [
69]. The samples are classified according to their estrogen receptor (ER) status: 25 samples are ER positive while the other 24 samples are ER negative.
5.3.1.2 Leukemia Data Set
The leukemia data set is an Affymetrix high-density oligonucleotide array that contains 7,070 genes and 72 samples from two classes of leukemia: 47 acute lymphoblastic leukemia and 25 acute myeloid leukemia [
17]. The data set is publicly available at
http://www.broad.mit.edu/cgibin/cancer/datasets.cgi.
5.3.1.3 Colon Cancer Data Set
5.3.2 Class Prediction Methods
The SVM [
67], K-NN rule [
12], and NB classifier [
12] are used to evaluate the performance of different
-information measures. A brief introduction of the SVM is reported in Chaps.
3 and
4. In this work, linear kernels are used in the SVM to construct the nonlinear decision boundary. On the other hand, descriptions of both K-NN rule and NB classifier are reported next.
5.3.2.1 K-Nearest Neighbor Rule
The K-nearest neighbor (K-NN) rule [
12] is used for evaluating the effectiveness of the reduced feature set for classification. It classifies samples based on its closest training samples in the feature space. A sample is classified by a majority vote of its K-neighbors, with the sample being assigned to the class most common amongst its K-nearest neighbors. The value of K, chosen for the K-NN, is the square root of number of samples in training set.
5.3.2.2 Naive Bayes Classifier
The naive Bayes (NB) classifier [
12] is one of the oldest classifiers. It is obtained by using the Bayes rule and assuming features or variables are independent of each other given its class. For the
th sample
with
gene expression levels
for the
genes, the posterior probability that
belongs to class
is
where
are conditional tables or conditional density estimated from training examples.
5.3.3 Performance Analysis
The experimental results on three microarray data sets are presented in Tables
5.1,
5.2,
5.3,
5.4,
5.5,
5.6,
5.7,
5.8 and
5.9. Subsequent discussions analyze the results with respect to the prediction accuracy of the NB, SVM, and K-NN classifiers. Tables
5.1,
5.2,
5.4,
5.5,
5.7, and
5.8 provide the performance of different
-information measures using the NB and SVM, respectively, while Tables
5.3,
5.6 and
5.9 shows the results using the K-NN rule. The values of
for
-information measures investigated are 0.2, 0.5, 0.8, 1.5, 2.0, 3.0, and 4.0. Some measures resemble mutual information for
(
and
) and some resemble another measure (
and
equal
). To compute the prediction accuracy of the NB, SVM, and K-NN, the leave-one-out cross-validation is performed on each gene expression data set. The number of genes selected ranges from 2 to 50 and each data set is preprocessed by standardizing each sample to zero mean and unit variance.
Table 5.1
Performance on breast cancer data set using NB classifier
|
95.9 |
98.0 |
98.0 |
98.0 |
100 |
100 |
100 |
100 |
100 |
100 |
98.0 |
98.0 |
|
95.9 |
98.0 |
98.0 |
98.0 |
100 |
100 |
100 |
98.0 |
98.0 |
98.0 |
98.0 |
98.0 |
|
95.9 |
100 |
95.9 |
98.0 |
98.0 |
98.0 |
95.9 |
93.9 |
91.8 |
91.8 |
89.8 |
89.8 |
|
95.9 |
98.0 |
95.9 |
100 |
98.0 |
93.9 |
93.9 |
89.8 |
87.8 |
87.8 |
87.8 |
87.8 |
|
95.9 |
98.0 |
95.9 |
93.9 |
93.9 |
91.8 |
91.8 |
89.8 |
85.7 |
83.7 |
83.7 |
81.6 |
|
95.9 |
95.9 |
95.9 |
93.9 |
91.8 |
91.8 |
91.8 |
87.8 |
87.8 |
83.7 |
83.7 |
81.6 |
|
95.9 |
95.9 |
95.9 |
93.9 |
91.8 |
91.8 |
89.8 |
87.8 |
87.8 |
83.7 |
83.7 |
81.6 |
|
95.9 |
95.9 |
95.9 |
91.8 |
91.8 |
89.8 |
87.8 |
87.8 |
85.7 |
83.7 |
81.6 |
81.6 |
|
85.7 |
95.9 |
95.9 |
98.0 |
100 |
100 |
100 |
100 |
98.0 |
98.0 |
98.0 |
98.0 |
|
95.9 |
98.0 |
98.0 |
98.0 |
100 |
100 |
100 |
98.0 |
98.0 |
98.0 |
98.0 |
98.0 |
|
95.9 |
93.9 |
95.9 |
98.0 |
93.9 |
91.8 |
91.8 |
87.8 |
85.7 |
85.7 |
85.7 |
79.6 |
|
87.8 |
89.8 |
83.7 |
85.7 |
89.8 |
87.8 |
87.8 |
83.7 |
85.7 |
85.7 |
83.7 |
83.7 |
|
95.9 |
98.0 |
95.9 |
98.0 |
93.9 |
89.8 |
89.8 |
85.7 |
83.7 |
81.6 |
79.6 |
79.6 |
|
95.9 |
95.9 |
95.9 |
93.9 |
91.8 |
91.8 |
91.8 |
87.8 |
87.8 |
83.7 |
83.7 |
81.6 |
|
95.9 |
95.9 |
95.9 |
93.9 |
93.9 |
93.9 |
93.9 |
89.8 |
87.8 |
85.7 |
83.7 |
83.7 |
|
95.9 |
98.0 |
100 |
95.9 |
95.9 |
93.9 |
93.9 |
89.8 |
85.7 |
85.7 |
85.7 |
85.7 |
|
95.9 |
98.0 |
98.0 |
98.0 |
100 |
100 |
100 |
100 |
100 |
100 |
98.0 |
98.0 |
|
95.9 |
98.0 |
98.0 |
98.0 |
100 |
100 |
100 |
98.0 |
98.0 |
98.0 |
98.0 |
98.0 |
|
95.9 |
100 |
95.9 |
95.9 |
98.0 |
98.0 |
95.9 |
93.9 |
91.8 |
91.8 |
89.8 |
89.8 |
|
95.9 |
98.0 |
95.9 |
100 |
98.0 |
93.9 |
93.9 |
89.8 |
87.8 |
87.8 |
87.8 |
87.8 |
|
95.9 |
98.0 |
95.9 |
93.9 |
91.8 |
91.8 |
91.8 |
89.8 |
87.8 |
83.7 |
83.7 |
83.7 |
|
95.9 |
91.8 |
95.9 |
95.9 |
91.8 |
91.8 |
91.8 |
89.8 |
85.7 |
83.7 |
83.7 |
81.6 |
|
93.9 |
89.8 |
93.9 |
93.9 |
93.9 |
91.8 |
91.8 |
91.8 |
89.8 |
85.7 |
83.7 |
79.6 |
|
93.9 |
93.9 |
91.8 |
91.8 |
91.8 |
91.8 |
89.8 |
89.8 |
87.8 |
83.7 |
83.7 |
81.6 |
|
95.9 |
98.0 |
98.0 |
100 |
95.9 |
93.9 |
93.9 |
91.8 |
91.8 |
89.8 |
89.8 |
89.8 |
Table 5.2
Performance on breast cancer data set using SVM
|
81.6 |
100 |
95.9 |
98.0 |
98.0 |
100 |
95.9 |
95.9 |
98.0 |
98.0 |
98.0 |
95.9 |
|
81.6 |
100 |
100 |
100 |
95.9 |
95.9 |
100 |
95.9 |
95.9 |
95.9 |
98.0 |
98.0 |
|
81.6 |
98.0 |
100 |
100 |
98.0 |
95.9 |
95.9 |
98.0 |
98.0 |
95.9 |
98.0 |
95.9 |
|
81.6 |
98.0 |
100 |
100 |
98.0 |
95.9 |
95.9 |
93.9 |
93.9 |
93.9 |
95.9 |
95.9 |
|
85.7 |
91.8 |
98.0 |
100 |
98.0 |
100 |
95.9 |
95.9 |
95.9 |
95.9 |
93.9 |
93.9 |
|
85.7 |
95.9 |
98.0 |
100 |
100 |
100 |
95.9 |
95.9 |
95.9 |
93.9 |
93.9 |
93.9 |
|
85.7 |
95.9 |
98.0 |
100 |
100 |
95.9 |
95.9 |
95.9 |
95.9 |
95.9 |
93.9 |
93.9 |
|
85.7 |
89.8 |
100 |
98.0 |
100 |
95.9 |
95.9 |
95.9 |
95.9 |
95.9 |
95.9 |
95.9 |
|
77.6 |
95.9 |
91.8 |
89.8 |
87.8 |
93.9 |
93.9 |
95.9 |
95.9 |
95.9 |
95.9 |
98.0 |
|
81.6 |
100 |
100 |
100 |
95.9 |
95.9 |
100 |
95.9 |
95.9 |
95.9 |
98.0 |
98.0 |
|
85.7 |
89.8 |
93.9 |
89.8 |
93.9 |
95.9 |
93.9 |
93.9 |
93.9 |
91.8 |
93.9 |
93.9 |
|
83.7 |
81.6 |
87.8 |
91.8 |
87.8 |
83.7 |
83.7 |
83.7 |
85.7 |
83.7 |
87.8 |
85.7 |
|
85.7 |
87.8 |
91.8 |
89.8 |
93.9 |
91.8 |
95.9 |
95.9 |
93.9 |
93.9 |
93.9 |
93.9 |
|
85.7 |
95.9 |
98.0 |
100 |
100 |
100 |
95.9 |
95.9 |
95.9 |
93.9 |
93.9 |
93.9 |
|
85.7 |
89.8 |
100 |
95.9 |
98.0 |
95.9 |
98.0 |
93.9 |
93.9 |
93.9 |
93.9 |
93.9 |
|
85.7 |
91.8 |
100 |
100 |
98.0 |
95.9 |
95.9 |
95.9 |
95.9 |
95.9 |
95.9 |
95.9 |
|
81.6 |
100 |
95.9 |
98.0 |
98.0 |
98.0 |
95.9 |
95.9 |
95.9 |
98.0 |
98.0 |
98.0 |
|
81.6 |
100 |
100 |
100 |
95.9 |
95.9 |
100 |
95.9 |
95.9 |
93.9 |
98.0 |
98.0 |
|
81.6 |
98.0 |
100 |
100 |
98.0 |
95.9 |
95.9 |
98.0 |
98.0 |
95.9 |
98.0 |
95.9 |
|
81.6 |
98.0 |
100 |
100 |
98.0 |
95.9 |
95.9 |
93.9 |
93.9 |
93.9 |
95.9 |
95.9 |
|
85.7 |
91.8 |
98.0 |
100 |
98.0 |
100 |
95.9 |
95.9 |
95.9 |
95.9 |
93.9 |
93.9 |
|
85.7 |
89.8 |
95.9 |
95.9 |
98.0 |
100 |
95.9 |
95.9 |
95.9 |
95.9 |
93.9 |
93.9 |
|
87.8 |
87.8 |
100 |
100 |
93.9 |
95.9 |
93.9 |
95.9 |
95.9 |
95.9 |
95.9 |
95.9 |
|
87.8 |
89.8 |
89.8 |
93.9 |
98.0 |
100 |
100 |
98.0 |
98.0 |
95.9 |
95.9 |
93.9 |
|
81.6 |
98.0 |
100 |
100 |
98.0 |
95.9 |
98.0 |
95.9 |
95.9 |
95.9 |
95.9 |
93.9 |
Table 5.3
Performance on breast cancer data set using K-NN rule