-means algorithm [18], self-organizing map [14, 40], and principal component analysis [33, 45] group a subset of genes that are interdependent or correlated with each other. In other words, genes or attributes in a cluster are more correlated with each other, whereas genes in different clusters are less correlated [2, 21, 41]. The attribute clustering is able to reduce the search dimension of a classification algorithm and constructs the model using a tightly correlated subset of genes rather than using the entire gene space. After clustering genes, a reduced set of genes can be selected for further analysis [2, 21, 41]. A brief survey on various gene clustering algorithms is reported in Chap. 8.
However, all these algorithm group genes according to unsupervised similarity measures computed from the gene expressions, without using any information about the sample categories or response variables. The information of response variables should be incorporated in attribute clustering to find groups of co-regulated genes with strong association to the sample categories [5]. In this background, some supervised attribute clustering algorithms such as supervised gene clustering [5], gene shaving [16], tree harvesting [15], and partial least square procedure [35] have been reported to reveal groups of co-regulated genes with strong association to the sample categories. The supervised attribute clustering is defined as the grouping of genes or attributes, controlled by the information of sample categories or response variables.
In general, the quality of generated clusters is always relative to a certain criterion. Different criteria may lead to different clustering results. However, every criterion tries to measure the similarity among the subset of genes present in a cluster. While tree harvesting [15] uses an unsupervised similarity measure to group a set of co-regulated genes, other supervised algorithms such as supervised gene clustering [5], gene shaving [16], and partial least square procedure [35] do not use any similarity measure to cluster genes; rather use different predictive scores such as Wilcoxon test [5] and Cox model score test [16] to measure gene-class relevance. Moreover, all these measures depend on the actual values of the training data. Hence, they may be sensitive to noise or outlier of the data set [8, 18, 21, 36]. On the other hand, as mutual information [3, 8, 19, 36] depends only on the probability distribution of a random variable, it has been widely used for computing both gene-class relevance and gene-gene redundancy or similarity [2, 3, 7, 8, 19, 28, 36].
This chapter presents a mutual information-based supervised attribute clustering (MISAC) algorithm [31] to find co-regulated clusters of genes whose collective expression is strongly associated with the sample categories or class labels. A new quantitative measure, based on mutual information, is introduced to compute the similarity between attributes. The new measure incorporates the information of sample categories while measuring the similarity between attributes. In effect, it helps to identify functional groups of genes that are of special interest in sample classification. The new supervised attribute clustering method uses this measure to reduce the redundancy among genes. It involves partitioning of the original gene set into some distinct subsets or clusters so that the genes within a cluster are highly co-regulated with strong association to the sample categories while those in different clusters are as dissimilar as possible. A single gene from each cluster having the highest gene-class relevance value is first selected as the initial representative of that cluster. The representative of each cluster is then modified by averaging the initial representative with other genes of that cluster whose collective expression is strongly associated with the sample categories. Finally, the modified representative of each cluster is selected to constitute the resulting reduced feature set. In effect, the MISAC algorithm yields biologically significant gene clusters, whose coherent average expression levels allow perfect discrimination of sample categories. Also, the MISAC algorithm avoids the noise sensitivity problem of existing supervised gene clustering algorithms. The performance of the MISAC algorithm, along with a comparison with existing algorithms, is studied both qualitatively and quantitatively on three cancer and two arthritis data sets using the class separability index and the predictive accuracy of naive Bayes classifier, K-nearest neighbor rule, and support vector machine.
The structure of the rest of this chapter is as follows: Sect. 9.2 briefly introduces existing supervised and unsupervised gene clustering algorithms, along with different existing criteria used for computing the relevance and redundancy. The new supervised attribute clustering algorithm is presented in Sect. 9.3. A few case studies and a comparison with existing algorithms are presented in Sect. 9.4. Concluding remarks are given in Sect. 9.5.
9.2 Clustering Genes for Sample Classification
In this section, some existing supervised and unsupervised gene clustering algorithms are reported, along with different widely used criteria for computing gene-class relevance and gene-gene redundancy.
9.2.1 Gene Clustering: Supervised Versus Unsupervised
Clustering is one of the major tasks in gene expression data analysis. To find groups of co-regulated genes from microarray data, different unsupervised clustering techniques such as hierarchical clustering [14, 17],
-means algorithm [18], self organizing map [14, 40], and principal component analysis [33, 45] have been widely used. The hierarchical clustering identifies sets of correlated genes with similar behavior across the samples, but yields thousands of clusters in a tree-like structure, which makes the identification of functional groups very difficult [14, 17]. In contrast, self organizing map [14, 40] and
-means algorithm [18] require a prespecified number and an initial spatial structure of clusters, but this may be hard to come up with in real problems. However, these algorithms usually fail to reveal functional groups of genes that are of special interest in sample classification as the genes are clustered by similarity only, without using any information about the sample categories or class labels [5].


To reveal groups of co-regulated genes with strong association to the sample categories, different supervised attribute clustering algorithms have been proposed recently [5, 15, 16, 35]. One notable work in this field encompasses tree harvesting [15], a two step method which consists first of generating numerous candidate groups by unsupervised hierarchical clustering. Then, the average expression profile of each cluster is considered as a potential input variable for a response model and the few gene groups that contain the most useful information for tissue discrimination are identified. Only this second step makes the clustering supervised, as the selection process relies on external information about the tissue types. Another supervised clustering method, called gene shaving, identifies subsets of genes with coherent expression patterns and large variation across the conditions [16]. The technique can be unsupervised, where the genes and samples are treated as unlabeled, or partially or fully supervised by using known properties of the genes or samples to assist in finding meaningful groupings.
An interesting supervised clustering approach that directly incorporates the response variables in the grouping process is the partial least squares procedure [35], which in a supervised manner constructs weighted linear combinations of genes that have maximal covariance with the outcome. However, it has the drawback that the fitted components involve all (usually thousands of) genes, which makes them very difficult to interpret. Moreover, partial least squares for every component yields a linear combination of gene expressions which completely lacks the biological interpretation of having a cluster of genes acting similarly in the same pathway.
A direct approach to combine gene selection, clustering, and supervision in one single step is reported in [5]. A similar single step approach is also pursued by Jornsten and Yu [23]. The supervised attribute clustering algorithm proposed in [5] is a combination of gene selection for cluster membership and formation of a new predictor by possible sign flipping and averaging the gene expressions within a cluster. The cluster membership is determined with a forward and backward searching technique that optimizes the Wilcoxon test-based predictive score and margin criterion defined in [5], which both involve the supervised response variables from the data. However, as both predictive score and margin criterion depend on the actual gene expression values, they are very much sensitive to noise or outlier of the data set.
9.2.2 Criteria for Gene Selection and Clustering
As reported in Chap. 5, the
-test,
-test [8, 26], information gain, mutual information [8, 36], normalized mutual information [28], and
-information [29] are typically used to measure the relevance of a gene with respect to the class labels or sample categories and the same or a different metric such as mutual information, the
distance, Euclidean distance, and Pearson’s correlation coefficient [8, 21, 36] is employed to calculate the similarity or redundancy between genes.




To measure the relevance of a gene, the
-test is widely used, assuming that there are two classes of samples in a gene expression data set. When there are multiple classes of samples, the
-test is typically computed for one class versus all the other classes. For multiple classes of samples, an
-test between a gene and the class label can be used to calculate the relevance score of that gene. The
-test reduces to the
-test for two class problem with the relation
. In [5], the Wilcoxon’s test statistic is used to compute the relevance of a gene assuming two classes of samples in microarray data set.






On the other hand, the Euclidean distance measures the difference in the individual magnitudes of each gene. However, the genes regarded as similar by the Euclidean distance may be very dissimilar in terms of their shapes. Similarly, the Euclidean distance between two genes having an identical shape may be large if they differ from each other by a large scaling factor. But, the overall shapes of genes are of the primary interest for gene expression data [21]. Hence, the Euclidean distance may not be able to yield a good proximity measurement of genes [21]. The Pearson’s correlation coefficient considers each gene as a random variable and measures the similarity between two genes by calculating the linear relationship between distributions of two corresponding random variables. An empirical study has shown that Pearson’s correlation coefficient is not robust to outliers and it may assign high similarity score to a pair of dissimilar genes [18].
However, as the
-test,
-test, Wilcoxon’s test, Euclidean distance, and Pearson’s correlation depend on the actual gene expression values of microarray data, they are very much sensitive to noise or outlier of the data set. On the other hand, as the information theoretic measure such as entropy, mutual information, and
-information depends only on the probability distribution of a random variable rather than on its actual values, it is more effective to evaluate the gene-class relevance as well as gene-gene redundancy [8, 29, 36].



In principle, the mutual information is used to quantify the information shared by two objects. If two independent objects do not share much information, the mutual information value between them is small. While two highly correlated objects will demonstrate a high mutual information value [39]. The objects can be the class label and the genes. The necessity for a gene to be an independent and informative can, therefore, be determined by the shared information between the gene and the rest as well as the shared information between the gene and class label [8, 36]. If a gene has expression values randomly or uniformly distributed in different classes, its mutual information with these classes is zero. If a gene is strongly differentially expressed for different classes, it should have large mutual information. Thus, the mutual information can be used as a measure of relevance of genes. Similarly, the mutual information may be used to measure the level of similarity or redundancy between two genes.
9.3 Supervised Gene Clustering Algorithm
In this section, a mutual information-based supervised attribute clustering (MISAC) algorithm [31] is presented for grouping co-regulated genes with strong association to the class labels. It is based on a supervised similarity measure that follows next.
9.3.1 Supervised Similarity Measure
In real-data analysis, one of the important issues is computing both relevance and redundancy of attributes by discovering dependencies among them. Intuitively, a set of attributes
depends totally on a set of attributes
, if all attribute values from
are uniquely determined by values of attributes from
. If there exists a functional dependency between values of
and
, then
depends totally on
.








Let
be the set of
samples and
denotes the set of
attributes of a given data set
, where
is the measured value of the attribute
in the sample
. Let
be the set of class labels or sample categories of
samples. Define
as the relevance of the attribute
with respect to the class label
while
as the redundancy or similarity between two attributes
and
. The mutual information can be used to calculate both relevance and redundancy among attributes.
















The relevance
of the attribute
with respect to the class label
using mutual information can be calculated as follows:

where
represents the mutual information between attribute
and class label
that is given by

Here,
and
represent the entropy of attribute
and the conditional entropy of
given class label
, respectively. The entropy
is known to be a measure of the amount of uncertainty about the attribute
, while
is the amount of uncertainty left in
when knowing
. Hence, the quantity
is the reduction in the uncertainty of the attribute
by the knowledge of class label
. In other words, it represents the amount of information that the class label
contains about the attribute
.




(9.1)




(9.2)















Definition 9.1
For continuous random variables such as gene expression values, the entropy, conditional entropy, and mutual information can be defined as follows:



where
is the true probability density function of the attribute or variable
, while
and
represent the conditional probability density function of
given the variable
and the joint probability density function of
and
, respectively. Usually, the Gaussian function is used to approximate the true density function [12].

(9.3)

(9.4)

(9.5)








The redundancy or similarity between two attributes
and
, in terms of mutual information, can also be calculated as follows:

However, the term
does not incorporate the information of sample categories or class labels
while measuring the similarity and it is considered as unsupervised similarity measure. Hence, a new quantitative measure, called supervised similarity measure, is reported here based on mutual information for measuring the similarity between two random variables. It incorporates the information of sample categories or class labels while measuring the similarity between attributes.



(9.6)


Definition 9.2
The significance of an attribute
with respect to another attribute
can be defined as follows:




(9.7)
That is, the significance of an attribute
is the change in dependency when the attribute
is removed from the set
. The higher the change in dependency, the more significant the attribute
is. If the significance is 0, then the attribute
is dispensable.





Based on the concept of significance of an attribute, the supervised similarity measure between two attributes is defined next.
Definition 9.3
Hence, the supervised similarity measure
directly takes into account the information of sample categories or class labels
while computing the similarity between two attributes
and
. If attributes
and
are completely correlated with respect to class labels
, then
and so
is 1. If
and
are totally uncorrelated,
. Hence,
can be used as a measure of supervised similarity between two attributes
and
. The following properties can be stated about the measure:
The supervised similarity between two attributes
and
, in terms of entropy, is given by
![$$\begin{aligned} \varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)&= \big [1+\big [H({\fancyscript{A}}_i{\fancyscript{A}}_j|{\mathbb {D}})-\frac{1}{2} \left\{ H({\fancyscript{A}}_i|{\fancyscript{A}}_j)\,\right. \nonumber \\&\quad +\left. H({\fancyscript{A}}_j|{\fancyscript{A}}_i)+ H({\fancyscript{ A}}_i|{\mathbb {D}})+H({\fancyscript{A}}_j|{\mathbb {D}}) \right\} \big ]^2\big ]^{-1}. \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_9_Chapter_Equ11.gif?w=960)
Combining (9.6) and (9.11), the term
can be expressed as follows:
![$$\begin{aligned} \varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)&=\big [1+\big [{\mathrm {S}} ({\fancyscript{A}}_i,{\fancyscript{A}}_j)\,+\, H({\fancyscript{ A}}_i{\fancyscript{A}}_j|{\mathbb {D}})\nonumber \\&\quad -\frac{1}{2}\left\{ H({\fancyscript{A}}_i)+H({\fancyscript{A}}_j)\,+\, H({\fancyscript{A}}_i|{\mathbb {D}})+H({\fancyscript{A}}_j|{\mathbb {D}}) \right\} \big ]^2\big ]^{-1}. \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_9_Chapter_Equ12.gif?w=960)
Hence, the supervised similarity measure
not only considers the information of sample categories or class labels
, it also takes into account the unsupervised similarity between two attributes
.















1.
.

2.
if and only if
and
are completely correlated.



3.
if and only if
and
are totally uncorrelated.



4.
(symmetric).



![$$\begin{aligned} \varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)&= \big [1+\big [H({\fancyscript{A}}_i{\fancyscript{A}}_j|{\mathbb {D}})-\frac{1}{2} \left\{ H({\fancyscript{A}}_i|{\fancyscript{A}}_j)\,\right. \nonumber \\&\quad +\left. H({\fancyscript{A}}_j|{\fancyscript{A}}_i)+ H({\fancyscript{ A}}_i|{\mathbb {D}})+H({\fancyscript{A}}_j|{\mathbb {D}}) \right\} \big ]^2\big ]^{-1}. \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_9_Chapter_Equ11.gif?w=960)
(9.11)

![$$\begin{aligned} \varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)&=\big [1+\big [{\mathrm {S}} ({\fancyscript{A}}_i,{\fancyscript{A}}_j)\,+\, H({\fancyscript{ A}}_i{\fancyscript{A}}_j|{\mathbb {D}})\nonumber \\&\quad -\frac{1}{2}\left\{ H({\fancyscript{A}}_i)+H({\fancyscript{A}}_j)\,+\, H({\fancyscript{A}}_i|{\mathbb {D}})+H({\fancyscript{A}}_j|{\mathbb {D}}) \right\} \big ]^2\big ]^{-1}. \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_9_Chapter_Equ12.gif?w=960)
(9.12)



9.3.2 Gene Clustering Algorithm
The mutual information-based supervised attribute clustering algorithm [31], termed as MISAC, relies on mainly two factors, namely, determining the relevance of each attribute and growing the cluster around each relevant attribute incrementally by adding one attribute after the other. One of the important property of this clustering approach is that the cluster is augmented by the attributes those satisfy following two conditions:
The growth of a cluster is repeated until the cluster stabilizes, and then the MISAC algorithm starts to generate a new cluster.
1.
suit best into the current cluster in terms of a supervised similarity measure defined above; and
2.
improve the differential expression of the current cluster most, according to the relevance of the cluster representative or prototype.
Let
be the relevance of attribute
with respect to class label
. The relevance uses information about the class labels and is thus a criterion for supervised clustering. The MISAC algorithm starts with a single attribute
that has the highest relevance value with respect to class labels. An initial cluster
is formed by selecting the set of attributes
from the whole set
considering the attribute
as the representative of cluster
, where

Hence, the cluster
represents the set of attributes of
those have the supervised similarity values with the attribute
greater than a predefined threshold value
. The cluster
is the coarse cluster corresponding to the attribute
, while the threshold
is termed as the radius of cluster
(Fig. 9.1).











(9.13)









Fig. 9.1
Representation of a supervised attribute cluster
After forming the initial coarse cluster
, the cluster representative is refined incrementally. By searching among the attributes of cluster
, the current cluster representative is merged and averaged with one single attribute such that the augmented cluster representative
increases the relevance value. The merging process is repeated until the relevance value can no longer be improved. Instead of averaging all attributes of
, the augmented attribute
is computed by considering a subset of attributes
those increase the relevance value of cluster representative
. The set of attributes
represents the finer cluster of the attribute
(Fig. 9.1). While the generation of coarse cluster reduces the redundancy among attributes of the set
, that of finer cluster increases the relevance with respect to class labels. After generating the augmented cluster representative
from the finer cluster
, the process is repeated to find more clusters and augmented cluster representatives by discarding the set of attributes
from the whole set
.














To compute the set
corresponding to the attribute
, one may consider the conventional unsupervised similarity measure
as defined in (9.6). However, as it does not take into account the information of sample categories or class labels, the attributes are clustered by similarity only, without using any information about the sample categories. In effect, it fails to reveal functional groups of attributes that are of special interest in classification. On the other hand, as the supervised similarity measure
defined in (9.8) incorporates the class information directly while computing the similarity between two attributes
and
, it can identify functional groups present in the attribute set.






The main steps of the mutual information-based supervised attribute clustering (MISAC) algorithm are reported next.
Letbe the set of attributes of the original data set, while
and
are the set of actual and augmented attributes, respectively, selected by the MISAC algorithm.
Letbe the coarse cluster associated with the attribute
and
, the finer cluster of
(Fig. 9.1), represents the set of attributes of
those are merged and averaged with the attribute
to generate the augmented cluster representative
.
1.
Initialize
,
, and
.



2.
Calculate the relevance value
of each attribute
.


3.
Repeat the following nine steps (steps 4–12) until
or desired number of attributes is selected.

4.
Select attribute
from
as the representative of cluster
that has highest relevance value. In effect,
,
,
, and
.







5.
Generate coarse cluster
from the set of existing attributes of
satisfying the following condition:




6.
Initialize
.

7.
Repeat following four steps (steps 8–11) for each attribute
.

8.
Compute two augmented cluster representatives by averaging
and its complement with the attributes of
as follows:





(9.14)

(9.15)
9.
The augmented cluster representative
after averaging
or its complement with
is as follows:





(9.16)
10.
The augmented cluster representative
of cluster
is
if
, otherwise
remains unchanged.





11.
Select attribute
or its complement as a member of the finer cluster
of attribute
if
.




12.
In effect,
and
.


13.
Sort the set of augmented cluster representatives
according to their relevance value
with respect to the class labels
.



14.
Stop.
9.3.3 Fundamental Property
From the above discussions, the following properties corresponding to each cluster
can be derived:
The property 1 says that if an attribute
. That is, the supervised similarity between the attribute
of coarse cluster
and the initial cluster representative
is greater than a predefined threshold value
. The property 2 establishes the fact that if
, that is, the relevance of the cluster representative
is the maximum among that of all attributes of the cluster
. The properties 3 and 4 are of great importance in increasing the relevance of augmented cluster representative with respect to the class labels and reducing the redundancy among the attribute set. The property 3 says that if
. It means an attribute
belongs to the finer cluster
if and only if it increases the relevance value of the augmented cluster representative
. On the other hand, property 4 says that the attributes those belong to only coarse cluster
, not to finer cluster
, are not responsible to increase the relevance of augmented cluster representative. Hence, the set of attributes
increases the relevance value of the attribute
as well as reduces the redundancy of the whole set, while the set of attributes
is only responsible for reducing the redundancy. Finally, property 5 says that if an attribute
, that is, the attribute
is contained in
only. Hence, the MISAC algorithm generates nonoverlapping attribute clusters.

1.
.

2.
.

3.
.

4.
.

5.
.





















9.3.4 Computational Complexity
The computation of the relevance of
attributes is carried out in step 2 of the MISAC algorithm, which has
time complexity. The cluster generation steps, that is steps 4–12, are executed
times to generate
clusters and corresponding augmented cluster representatives. There are three loops in the cluster generation steps, which are executed
,
, and
times, respectively, where
represents the cardinality of the cluster
. Each iteration of the loops takes only a constant amount of time. Hence, the complexity to generate
clusters using steps 4–12 is
. The computing time of
becomes
for any value of
. Finally, step 13 performs the sorting of
augmented cluster representatives according to their relevance values, which has a computational complexity of
. Hence, the overall time complexity of the MISAC algorithm is
, that is,
. However, as the number of desired clusters
is constant and sufficiently small compared to the total number of attributes
, the MISAC algorithm has an overall
time complexity.






















Stay updated, free articles. Join our Telegram channel

Full access? Get Clinical Tree

