Mutual Information Based Supervised Attribute Clustering for Microarray Sample Classification

-means algorithm [18], self-organizing map [14, 40], and principal component analysis [33, 45] group a subset of genes that are interdependent or correlated with each other. In other words, genes or attributes in a cluster are more correlated with each other, whereas genes in different clusters are less correlated [2, 21, 41]. The attribute clustering is able to reduce the search dimension of a classification algorithm and constructs the model using a tightly correlated subset of genes rather than using the entire gene space. After clustering genes, a reduced set of genes can be selected for further analysis [2, 21, 41]. A brief survey on various gene clustering algorithms is reported in Chap. 8.


However, all these algorithm group genes according to unsupervised similarity measures computed from the gene expressions, without using any information about the sample categories or response variables. The information of response variables should be incorporated in attribute clustering to find groups of co-regulated genes with strong association to the sample categories [5]. In this background, some supervised attribute clustering algorithms such as supervised gene clustering [5], gene shaving [16], tree harvesting [15], and partial least square procedure [35] have been reported to reveal groups of co-regulated genes with strong association to the sample categories. The supervised attribute clustering is defined as the grouping of genes or attributes, controlled by the information of sample categories or response variables.

In general, the quality of generated clusters is always relative to a certain criterion. Different criteria may lead to different clustering results. However, every criterion tries to measure the similarity among the subset of genes present in a cluster. While tree harvesting [15] uses an unsupervised similarity measure to group a set of co-regulated genes, other supervised algorithms such as supervised gene clustering [5], gene shaving [16], and partial least square procedure [35] do not use any similarity measure to cluster genes; rather use different predictive scores such as Wilcoxon test [5] and Cox model score test [16] to measure gene-class relevance. Moreover, all these measures depend on the actual values of the training data. Hence, they may be sensitive to noise or outlier of the data set [8, 18, 21, 36]. On the other hand, as mutual information [3, 8, 19, 36] depends only on the probability distribution of a random variable, it has been widely used for computing both gene-class relevance and gene-gene redundancy or similarity [2, 3, 7, 8, 19, 28, 36].

This chapter presents a mutual information-based supervised attribute clustering (MISAC) algorithm [31] to find co-regulated clusters of genes whose collective expression is strongly associated with the sample categories or class labels. A new quantitative measure, based on mutual information, is introduced to compute the similarity between attributes. The new measure incorporates the information of sample categories while measuring the similarity between attributes. In effect, it helps to identify functional groups of genes that are of special interest in sample classification. The new supervised attribute clustering method uses this measure to reduce the redundancy among genes. It involves partitioning of the original gene set into some distinct subsets or clusters so that the genes within a cluster are highly co-regulated with strong association to the sample categories while those in different clusters are as dissimilar as possible. A single gene from each cluster having the highest gene-class relevance value is first selected as the initial representative of that cluster. The representative of each cluster is then modified by averaging the initial representative with other genes of that cluster whose collective expression is strongly associated with the sample categories. Finally, the modified representative of each cluster is selected to constitute the resulting reduced feature set. In effect, the MISAC algorithm yields biologically significant gene clusters, whose coherent average expression levels allow perfect discrimination of sample categories. Also, the MISAC algorithm avoids the noise sensitivity problem of existing supervised gene clustering algorithms. The performance of the MISAC algorithm, along with a comparison with existing algorithms, is studied both qualitatively and quantitatively on three cancer and two arthritis data sets using the class separability index and the predictive accuracy of naive Bayes classifier, K-nearest neighbor rule, and support vector machine.

The structure of the rest of this chapter is as follows: Sect. 9.2 briefly introduces existing supervised and unsupervised gene clustering algorithms, along with different existing criteria used for computing the relevance and redundancy. The new supervised attribute clustering algorithm is presented in Sect. 9.3. A few case studies and a comparison with existing algorithms are presented in Sect. 9.4. Concluding remarks are given in Sect. 9.5.



9.2 Clustering Genes for Sample Classification


In this section, some existing supervised and unsupervised gene clustering algorithms are reported, along with different widely used criteria for computing gene-class relevance and gene-gene redundancy.


9.2.1 Gene Clustering: Supervised Versus Unsupervised


Clustering is one of the major tasks in gene expression data analysis. To find groups of co-regulated genes from microarray data, different unsupervised clustering techniques such as hierarchical clustering [14, 17], $$k$$-means algorithm [18], self organizing map [14, 40], and principal component analysis [33, 45] have been widely used. The hierarchical clustering identifies sets of correlated genes with similar behavior across the samples, but yields thousands of clusters in a tree-like structure, which makes the identification of functional groups very difficult [14, 17]. In contrast, self organizing map [14, 40] and $$k$$-means algorithm [18] require a prespecified number and an initial spatial structure of clusters, but this may be hard to come up with in real problems. However, these algorithms usually fail to reveal functional groups of genes that are of special interest in sample classification as the genes are clustered by similarity only, without using any information about the sample categories or class labels [5].

To reveal groups of co-regulated genes with strong association to the sample categories, different supervised attribute clustering algorithms have been proposed recently [5, 15, 16, 35]. One notable work in this field encompasses tree harvesting [15], a two step method which consists first of generating numerous candidate groups by unsupervised hierarchical clustering. Then, the average expression profile of each cluster is considered as a potential input variable for a response model and the few gene groups that contain the most useful information for tissue discrimination are identified. Only this second step makes the clustering supervised, as the selection process relies on external information about the tissue types. Another supervised clustering method, called gene shaving, identifies subsets of genes with coherent expression patterns and large variation across the conditions [16]. The technique can be unsupervised, where the genes and samples are treated as unlabeled, or partially or fully supervised by using known properties of the genes or samples to assist in finding meaningful groupings.

An interesting supervised clustering approach that directly incorporates the response variables in the grouping process is the partial least squares procedure [35], which in a supervised manner constructs weighted linear combinations of genes that have maximal covariance with the outcome. However, it has the drawback that the fitted components involve all (usually thousands of) genes, which makes them very difficult to interpret. Moreover, partial least squares for every component yields a linear combination of gene expressions which completely lacks the biological interpretation of having a cluster of genes acting similarly in the same pathway.

A direct approach to combine gene selection, clustering, and supervision in one single step is reported in [5]. A similar single step approach is also pursued by Jornsten and Yu [23]. The supervised attribute clustering algorithm proposed in [5] is a combination of gene selection for cluster membership and formation of a new predictor by possible sign flipping and averaging the gene expressions within a cluster. The cluster membership is determined with a forward and backward searching technique that optimizes the Wilcoxon test-based predictive score and margin criterion defined in [5], which both involve the supervised response variables from the data. However, as both predictive score and margin criterion depend on the actual gene expression values, they are very much sensitive to noise or outlier of the data set.


9.2.2 Criteria for Gene Selection and Clustering


As reported in Chap. 5, the $$t$$-test, $$F$$-test [8, 26], information gain, mutual information [8, 36], normalized mutual information [28], and $$f$$-information [29] are typically used to measure the relevance of a gene with respect to the class labels or sample categories and the same or a different metric such as mutual information, the $$L_1$$ distance, Euclidean distance, and Pearson’s correlation coefficient [8, 21, 36] is employed to calculate the similarity or redundancy between genes.

To measure the relevance of a gene, the $$t$$-test is widely used, assuming that there are two classes of samples in a gene expression data set. When there are multiple classes of samples, the $$t$$-test is typically computed for one class versus all the other classes. For multiple classes of samples, an $$F$$-test between a gene and the class label can be used to calculate the relevance score of that gene. The $$F$$-test reduces to the $$t$$-test for two class problem with the relation $$F=t^2$$. In [5], the Wilcoxon’s test statistic is used to compute the relevance of a gene assuming two classes of samples in microarray data set.

On the other hand, the Euclidean distance measures the difference in the individual magnitudes of each gene. However, the genes regarded as similar by the Euclidean distance may be very dissimilar in terms of their shapes. Similarly, the Euclidean distance between two genes having an identical shape may be large if they differ from each other by a large scaling factor. But, the overall shapes of genes are of the primary interest for gene expression data [21]. Hence, the Euclidean distance may not be able to yield a good proximity measurement of genes [21]. The Pearson’s correlation coefficient considers each gene as a random variable and measures the similarity between two genes by calculating the linear relationship between distributions of two corresponding random variables. An empirical study has shown that Pearson’s correlation coefficient is not robust to outliers and it may assign high similarity score to a pair of dissimilar genes [18].

However, as the $$t$$-test, $$F$$-test, Wilcoxon’s test, Euclidean distance, and Pearson’s correlation depend on the actual gene expression values of microarray data, they are very much sensitive to noise or outlier of the data set. On the other hand, as the information theoretic measure such as entropy, mutual information, and $$f$$-information depends only on the probability distribution of a random variable rather than on its actual values, it is more effective to evaluate the gene-class relevance as well as gene-gene redundancy [8, 29, 36].

In principle, the mutual information is used to quantify the information shared by two objects. If two independent objects do not share much information, the mutual information value between them is small. While two highly correlated objects will demonstrate a high mutual information value [39]. The objects can be the class label and the genes. The necessity for a gene to be an independent and informative can, therefore, be determined by the shared information between the gene and the rest as well as the shared information between the gene and class label [8, 36]. If a gene has expression values randomly or uniformly distributed in different classes, its mutual information with these classes is zero. If a gene is strongly differentially expressed for different classes, it should have large mutual information. Thus, the mutual information can be used as a measure of relevance of genes. Similarly, the mutual information may be used to measure the level of similarity or redundancy between two genes.


9.3 Supervised Gene Clustering Algorithm


In this section, a mutual information-based supervised attribute clustering (MISAC) algorithm [31] is presented for grouping co-regulated genes with strong association to the class labels. It is based on a supervised similarity measure that follows next.


9.3.1 Supervised Similarity Measure


In real-data analysis, one of the important issues is computing both relevance and redundancy of attributes by discovering dependencies among them. Intuitively, a set of attributes $${\mathbb {Q}}$$ depends totally on a set of attributes $${\mathbb {P}}$$, if all attribute values from $${\mathbb {Q}}$$ are uniquely determined by values of attributes from $${\mathbb {P}}$$. If there exists a functional dependency between values of $${\mathbb {Q}}$$ and $${\mathbb {P}}$$, then $${\mathbb {Q}}$$ depends totally on $${\mathbb {P}}$$.

Let $${\mathbb {U}}=\{x_1,\ldots ,x_i,\ldots ,x_n\}$$ be the set of $$n$$ samples and $${\mathbb {A}}=\{{\fancyscript{A}}_1,\ldots , {\fancyscript{A}}_i,\ldots ,{\fancyscript{A}}_{m}\}$$ denotes the set of $$m$$ attributes of a given data set $${\fancyscript{T}}=\{{w}_{ij}|i=1,\ldots ,m,j=1,\ldots ,n\}$$, where $$w_{ij} \in \mathfrak {R}$$ is the measured value of the attribute $${\fancyscript{A}}_i$$ in the sample $$x_j$$. Let $${\mathbb {D}}= \{{ {D}}_1,\ldots ,{{D}}_i,\ldots ,{{D}}_n\}$$ be the set of class labels or sample categories of $$n$$ samples. Define $${\mathrm {R}}_{{\fancyscript{A}}_i}({\mathbb {D}})$$ as the relevance of the attribute $${\fancyscript{A}}_i$$ with respect to the class label $${\mathbb {D}}$$ while $${\mathrm {S}}({\fancyscript{A}}_i,{\fancyscript{A}}_j)$$ as the redundancy or similarity between two attributes $${\fancyscript{A}}_i$$ and $${\fancyscript{A}}_j$$. The mutual information can be used to calculate both relevance and redundancy among attributes.

The relevance$${\mathrm {R}}_{{\fancyscript{A}}_i}({\mathbb {D}})$$ of the attribute $${\fancyscript{A}}_i$$ with respect to the class label $${\mathbb {D}}$$ using mutual information can be calculated as follows:


$$\begin{aligned} {\mathrm {R}}_{{\fancyscript{A}}_i}({\mathbb {D}})=I({\fancyscript{A}}_i,\,{\mathbb {D}}) \end{aligned}$$

(9.1)
where $$I({\fancyscript{A}}_i,\,{\mathbb {D}})$$ represents the mutual information between attribute $${\fancyscript{A}}_i$$ and class label $${\mathbb {D}}$$ that is given by


$$\begin{aligned} I({\fancyscript{A}}_i,{\mathbb {D}})=H({\fancyscript{A}}_i)-H({\fancyscript{A}}_i|{\mathbb {D}}). \end{aligned}$$

(9.2)
Here, $$H({\fancyscript{A}}_i)$$ and $$H({\fancyscript{A}}_i|{\mathbb {D}})$$ represent the entropy of attribute $${\fancyscript{ A}}_i$$ and the conditional entropy of $${\fancyscript{ A}}_i$$ given class label $${\mathbb {D}}$$, respectively. The entropy $$H({\fancyscript{ A}}_i)$$ is known to be a measure of the amount of uncertainty about the attribute $${\fancyscript{ A}}_i$$, while $$H({\fancyscript{ A}}_i|{\mathbb {D}})$$ is the amount of uncertainty left in $${\fancyscript{A}}_i$$ when knowing $${\mathbb {D}}$$. Hence, the quantity $$I({\fancyscript{A}}_i,{\mathbb {D}})$$ is the reduction in the uncertainty of the attribute $${\fancyscript{ A}}_i$$ by the knowledge of class label $${\mathbb {D}}$$. In other words, it represents the amount of information that the class label $${\mathbb {D}}$$ contains about the attribute $${\fancyscript{A}}_i$$.


Definition 9.1

For continuous random variables such as gene expression values, the entropy, conditional entropy, and mutual information can be defined as follows:


$$\begin{aligned} H({\fancyscript{Y}})=- \int p(y)\log p(y)\mathrm{d}y; \end{aligned}$$

(9.3)



$$\begin{aligned} H({\fancyscript{Y}}|{\fancyscript{Z}})=- \int p(y,z)\log p(y|z)\mathrm{d}y\mathrm{d}z; \end{aligned}$$

(9.4)



$$\begin{aligned} I({\fancyscript{Y}},{\fancyscript{Z}})=\int \int p(y,z)\log \frac{p(y,z)}{p(y)p(z)}\mathrm{d}y\mathrm{d}z \end{aligned}$$

(9.5)
where $$p(y)$$ is the true probability density function of the attribute or variable $${\fancyscript{Y}}$$, while $$p(y|z)$$ and $$p(y,z)$$ represent the conditional probability density function of $${\fancyscript{Y}}$$ given the variable $${\fancyscript{Z}}$$ and the joint probability density function of $${\fancyscript{Y}}$$ and $${\fancyscript{Z}}$$, respectively. Usually, the Gaussian function is used to approximate the true density function [12].

The redundancy or similarity between two attributes $${\fancyscript{ A}}_i$$ and $${\fancyscript{ A}}_j$$, in terms of mutual information, can also be calculated as follows:


$$\begin{aligned} {\mathrm {S}}({\fancyscript{A}}_i,{\fancyscript{ A}}_j)= I({\fancyscript{ A}}_i,{\fancyscript{A}}_j). \end{aligned}$$

(9.6)
However, the term $${\mathrm {S}}({\fancyscript{A}}_i,{\fancyscript{A}}_j)$$ does not incorporate the information of sample categories or class labels $${\mathbb {D}}$$ while measuring the similarity and it is considered as unsupervised similarity measure. Hence, a new quantitative measure, called supervised similarity measure, is reported here based on mutual information for measuring the similarity between two random variables. It incorporates the information of sample categories or class labels while measuring the similarity between attributes.


Definition 9.2

The significance of an attribute $${\fancyscript{ A}}_j$$ with respect to another attribute $${\fancyscript{A}}_i$$ can be defined as follows:


$$\begin{aligned} \sigma _{\{{\fancyscript{A}}_i,{\fancyscript{ A}}_j\}}({\mathbb {D}},{\fancyscript{A}}_j)={\mathrm {R}}_{\{{\fancyscript{ A}}_i,{\fancyscript{A}}_j\}}({\mathbb {D}})-{\mathrm {R}}_{{\fancyscript{ A}}_i}({\mathbb {D}}). \end{aligned}$$

(9.7)

That is, the significance of an attribute $${\fancyscript{A}}_j$$ is the change in dependency when the attribute $${\fancyscript{A}}_j$$ is removed from the set $$\{{\fancyscript{A}}_i,{\fancyscript{A}}_j\}$$. The higher the change in dependency, the more significant the attribute $${\fancyscript{A}}_j$$ is. If the significance is 0, then the attribute $${\fancyscript{A}}_j$$ is dispensable.

Based on the concept of significance of an attribute, the supervised similarity measure between two attributes is defined next.


Definition 9.3

The supervised similarity between two attributes $${\fancyscript{A}}_i$$ and $${\fancyscript{A}}_j$$ is defined as follows [31]:


$$\begin{aligned} \varPsi ({\fancyscript{ A}}_i,{\fancyscript{ A}}_j)=\frac{1}{1+{\kappa }^2}, \end{aligned}$$

(9.8)



$$\begin{aligned} \mathrm{where}~~\kappa =\left\{ \frac{\sigma _{\{{\fancyscript{ A}}_i,{\fancyscript{ A}}_j\}}({\mathbb {D}},{\fancyscript{ A}}_j)+\sigma _{\{{\fancyscript{ A}}_i,{\fancyscript{ A}}_j\}}({\mathbb {D}},{\fancyscript{ A}}_i)}{2} \right\} \end{aligned}$$

(9.9)



$$\begin{aligned} \mathrm{that~is,}~~\kappa ={\mathrm {R}}_{\{{\fancyscript{ A}}_i,{\fancyscript{ A}}_j\}}({\mathbb {D}}) -\left\{ \frac{{\mathrm {R}}_{{\fancyscript{ A}}_i}({\mathbb {D}})+{\mathrm {R}}_{{\fancyscript{ A}}_j}({\mathbb {D}})}{2}\right\} \!. \end{aligned}$$

(9.10)

Hence, the supervised similarity measure $$\varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)$$ directly takes into account the information of sample categories or class labels $${\mathbb {D}}$$ while computing the similarity between two attributes $${\fancyscript{A}}_i$$ and $${\fancyscript{A}}_j$$. If attributes $${\fancyscript{A}}_i$$ and $${\fancyscript{A}}_j$$ are completely correlated with respect to class labels $${\mathbb {D}}$$, then $$\kappa =0$$ and so $$\varPsi ({\fancyscript{ A}}_i,{\fancyscript{ A}}_j)$$ is 1. If $${\fancyscript{A}}_i$$ and $${\fancyscript{A}}_j$$ are totally uncorrelated, $$\varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j) \rightarrow 0$$. Hence, $$\varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)$$ can be used as a measure of supervised similarity between two attributes $${\fancyscript{A}}_i$$ and $${\fancyscript{A}}_j$$. The following properties can be stated about the measure:

1.

$$0 < \varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j) \le 1$$.

 

2.

$$\varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)=1$$ if and only if $${\fancyscript{A}}_i$$ and $${\fancyscript{A}}_j$$ are completely correlated.

 

3.

$$\varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j) \rightarrow 0$$ if and only if $${\fancyscript{A}}_i$$ and $${\fancyscript{A}}_j$$ are totally uncorrelated.

 

4.

$$\varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)=\varPsi ({\fancyscript{A}}_j,{\fancyscript{A}}_i)$$ (symmetric).

 
The supervised similarity between two attributes $${\fancyscript{A}}_i$$ and $${\fancyscript{A}}_j$$, in terms of entropy, is given by


$$\begin{aligned} \varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)&= \big [1+\big [H({\fancyscript{A}}_i{\fancyscript{A}}_j|{\mathbb {D}})-\frac{1}{2} \left\{ H({\fancyscript{A}}_i|{\fancyscript{A}}_j)\,\right. \nonumber \\&\quad +\left. H({\fancyscript{A}}_j|{\fancyscript{A}}_i)+ H({\fancyscript{ A}}_i|{\mathbb {D}})+H({\fancyscript{A}}_j|{\mathbb {D}}) \right\} \big ]^2\big ]^{-1}. \end{aligned}$$

(9.11)
Combining (9.6) and (9.11), the term $$\varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)$$ can be expressed as follows:


$$\begin{aligned} \varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)&=\big [1+\big [{\mathrm {S}} ({\fancyscript{A}}_i,{\fancyscript{A}}_j)\,+\, H({\fancyscript{ A}}_i{\fancyscript{A}}_j|{\mathbb {D}})\nonumber \\&\quad -\frac{1}{2}\left\{ H({\fancyscript{A}}_i)+H({\fancyscript{A}}_j)\,+\, H({\fancyscript{A}}_i|{\mathbb {D}})+H({\fancyscript{A}}_j|{\mathbb {D}}) \right\} \big ]^2\big ]^{-1}. \end{aligned}$$

(9.12)
Hence, the supervised similarity measure $$\varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)$$ not only considers the information of sample categories or class labels $${\mathbb {D}}$$, it also takes into account the unsupervised similarity between two attributes $${\mathrm {S}}({\fancyscript{A}}_i,{\fancyscript{A}}_j)$$.


9.3.2 Gene Clustering Algorithm


The mutual information-based supervised attribute clustering algorithm [31], termed as MISAC, relies on mainly two factors, namely, determining the relevance of each attribute and growing the cluster around each relevant attribute incrementally by adding one attribute after the other. One of the important property of this clustering approach is that the cluster is augmented by the attributes those satisfy following two conditions:

1.

suit best into the current cluster in terms of a supervised similarity measure defined above; and

 

2.

improve the differential expression of the current cluster most, according to the relevance of the cluster representative or prototype.

 
The growth of a cluster is repeated until the cluster stabilizes, and then the MISAC algorithm starts to generate a new cluster.

Let $${\mathrm {R}}_{{\fancyscript{A}}_i}({\mathbb {D}})$$ be the relevance of attribute $${\fancyscript{A}}_i \in {\mathbb {A}}$$ with respect to class label $${\mathbb {D}}$$. The relevance uses information about the class labels and is thus a criterion for supervised clustering. The MISAC algorithm starts with a single attribute $${\fancyscript{A}}_i$$ that has the highest relevance value with respect to class labels. An initial cluster $${\mathbb {V}}_i$$ is formed by selecting the set of attributes $$\{{\fancyscript{A}}_j\}$$ from the whole set $${\mathbb {A}}$$ considering the attribute $${\fancyscript{A}}_i$$ as the representative of cluster $${\mathbb {V}}_i$$, where


$$\begin{aligned} {\mathbb {V}}_i=\{{\fancyscript{A}}_j|\varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)\ge \delta ; {\fancyscript{A}}_j \ne {\fancyscript{A}}_i \in {\mathbb {A}}\}. \end{aligned}$$

(9.13)
Hence, the cluster $${\mathbb {V}}_i$$ represents the set of attributes of $${\mathbb {A}}$$ those have the supervised similarity values with the attribute $${\fancyscript{A}}_i$$ greater than a predefined threshold value $$\delta $$. The cluster $${\mathbb {V}}_i$$ is the coarse cluster corresponding to the attribute $${\fancyscript{A}}_i$$, while the threshold $$\delta $$ is termed as the radius of cluster $${\mathbb {V}}_i$$ (Fig. 9.1).

A319338_1_En_9_Fig1_HTML.gif


Fig. 9.1
Representation of a supervised attribute cluster

After forming the initial coarse cluster $${\mathbb {V}}_i$$, the cluster representative is refined incrementally. By searching among the attributes of cluster $${\mathbb {V}}_i$$, the current cluster representative is merged and averaged with one single attribute such that the augmented cluster representative $${\bar{\fancyscript{A}}}_i$$ increases the relevance value. The merging process is repeated until the relevance value can no longer be improved. Instead of averaging all attributes of $${\mathbb {V}}_i$$, the augmented attribute $${\bar{\fancyscript{A}}}_i$$ is computed by considering a subset of attributes $${\bar{\mathbb {V}}}_i \subset {\mathbb {V}}_i$$ those increase the relevance value of cluster representative $${\bar{\fancyscript{A}}}_i$$. The set of attributes $${\bar{\mathbb {V}}}_i$$ represents the finer cluster of the attribute $${\fancyscript{A}}_i$$ (Fig. 9.1). While the generation of coarse cluster reduces the redundancy among attributes of the set $${\mathbb {A}}$$, that of finer cluster increases the relevance with respect to class labels. After generating the augmented cluster representative $${\bar{\fancyscript{A}}}_i$$ from the finer cluster $${\bar{\mathbb {V}}}_i$$, the process is repeated to find more clusters and augmented cluster representatives by discarding the set of attributes $${\mathbb {V}}_i$$ from the whole set $${\mathbb {A}}$$.

To compute the set $${\mathbb {V}}_i$$ corresponding to the attribute $${\fancyscript{A}}_i$$, one may consider the conventional unsupervised similarity measure $${\mathrm {S}}({\fancyscript{A}}_i,{\fancyscript{A}}_j)$$ as defined in (9.6). However, as it does not take into account the information of sample categories or class labels, the attributes are clustered by similarity only, without using any information about the sample categories. In effect, it fails to reveal functional groups of attributes that are of special interest in classification. On the other hand, as the supervised similarity measure $$\varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)$$ defined in (9.8) incorporates the class information directly while computing the similarity between two attributes $${\fancyscript{A}}_i$$ and $${\fancyscript{A}}_j$$, it can identify functional groups present in the attribute set.

The main steps of the mutual information-based supervised attribute clustering (MISAC) algorithm are reported next.



  • Let $${\mathbb {C}}$$ be the set of attributes of the original data set, while $${\mathbb S}$$ and $${\bar{\mathbb S}}$$ are the set of actual and augmented attributes, respectively, selected by the MISAC algorithm.


  • Let $${\mathbb {V}}_i$$ be the coarse cluster associated with the attribute $${\fancyscript{ A}}_i$$ and $${\bar{\mathbb {V}}}_i$$, the finer cluster of $${\fancyscript{ A}}_i$$ (Fig. 9.1), represents the set of attributes of $${\mathbb {V}}_i$$ those are merged and averaged with the attribute $${\fancyscript{ A}}_i$$ to generate the augmented cluster representative $${\bar{\fancyscript{ A}}}_i$$.


1.

Initialize $${\mathbb {C}} \leftarrow {\mathbb A}=\{{\fancyscript{A}}_1,\ldots , {\fancyscript{ A}}_i,\ldots ,{\fancyscript{ A}}_j,\ldots ,{\fancyscript{A}}_m\}$$, $${\mathbb S} \leftarrow \emptyset $$, and $${\bar{\mathbb S}} \leftarrow \emptyset $$.

 

2.

Calculate the relevance value $${\mathrm {R}}_{{\fancyscript{ A}}_i}({\mathbb {D}})$$ of each attribute $${\fancyscript{ A}}_i \in {\mathbb {C}}$$.

 

3.

Repeat the following nine steps (steps 4–12) until $${\mathbb {C}}=\emptyset $$ or desired number of attributes is selected.

 

4.

Select attribute $${\fancyscript{ A}}_i$$ from $${\mathbb {C}}$$ as the representative of cluster $${\mathbb {V}}_i$$ that has highest relevance value. In effect, $${\fancyscript{ A}}_i \in {\mathbb S}$$, $${\fancyscript{ A}}_i \in {\mathbb {V}}_i$$, $${\fancyscript{ A}}_i \in {\bar{\mathbb {V}}}_i$$, and $${\mathbb {C}}={\mathbb {C}} \setminus {\fancyscript{ A}}_i$$.

 

5.

Generate coarse cluster $${\mathbb {V}}_i$$ from the set of existing attributes of $${\mathbb {C}}$$ satisfying the following condition:


$$\begin{aligned} {\mathbb {V}}_i=\{{\fancyscript{A}}_j|\varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)\ge \delta ; {\fancyscript{A}}_j \ne {\fancyscript{A}}_i \in {\mathbb {C}}\}. \end{aligned}$$

 

6.

Initialize $${\bar{\fancyscript{ A}}}_i \leftarrow {\fancyscript{ A}}_i$$.

 

7.

Repeat following four steps (steps 8–11) for each attribute $${\fancyscript{A}}_j \in {\mathbb {V}}_i$$.

 

8.

Compute two augmented cluster representatives by averaging $${\fancyscript{A}}_j$$ and its complement with the attributes of $${\bar{\mathbb {V}}}_i$$ as follows:


$$\begin{aligned} {\bar{\fancyscript{A}}}_{i+j}^{+}=\frac{1}{|{\bar{\mathbb {V}}}_i|+1} \left\{ \sum _{\fancyscript{A}_k \in {\bar{\mathbb {V}}}_i} {\fancyscript{A}}_k+{\fancyscript{A}}_j \right\} ; \end{aligned}$$

(9.14)



$$\begin{aligned} {\bar{\fancyscript{A}}}_{i+j}^{-}=\frac{1}{|{\bar{\mathbb {V}}}_i|+1} \left\{ \sum _{\fancyscript{A}_k \in {\bar{\mathbb {V}}}_i} {\fancyscript{A}}_k-{\fancyscript{A}}_j \right\} . \end{aligned}$$

(9.15)

 

9.

The augmented cluster representative $${\bar{\fancyscript{A}}}_{i+j}$$ after averaging $${\fancyscript{A}}_j$$ or its complement with $${\bar{\mathbb {V}}}_i$$ is as follows:


$$\begin{aligned} {\bar{\fancyscript{A}}}_{i+j} = \left\{ \begin{array}{ll} {\bar{\fancyscript{A}}}_{i+j}^{+} &{}\quad \text{ if } {\mathrm R}_{{\bar{\fancyscript{A}}}_{i+j}^{+}}({\mathbb {D}}) \ge {\mathrm R}_{{\bar{\fancyscript{A}}}_{i+j}^{-}}({\mathbb {D}})\\ {\bar{\fancyscript{A}}}_{i+j}^{-} &{}\quad \text{ otherwise. }\\ \end{array} \right. \end{aligned}$$

(9.16)

 

10.

The augmented cluster representative $${\bar{\fancyscript{A}}}_i$$ of cluster $${\mathbb {V}}_i$$ is $${\bar{\fancyscript{A}}}_{i+j}$$ if $${\mathrm R}_{{\bar{\fancyscript{A}}}_{i+j}}({\mathbb {D}}) \ge {\mathrm R}_{{\bar{\fancyscript{A}}}_i}({\mathbb {D}})$$, otherwise $${\bar{\fancyscript{A}}}_i$$ remains unchanged.

 

11.

Select attribute $${\fancyscript{ A}}_j$$ or its complement as a member of the finer cluster $${\bar{\mathbb {V}}}_i$$ of attribute $${\fancyscript{A}}_i$$ if $${\mathrm R}_{{\bar{\fancyscript{A}}}_{i+j}}({\mathbb {D}}) \ge {\mathrm R}_{{\bar{\fancyscript{A}}}_i}({\mathbb {D}})$$.

 

12.

In effect, $${\bar{\fancyscript{ A}}}_i \in {\bar{\mathbb S}}$$ and $${\mathbb {C}}={\mathbb {C}} \setminus {\mathbb {V}}_i$$.

 

13.

Sort the set of augmented cluster representatives $${\bar{\mathbb S}}=\{{\bar{\fancyscript{ A}}}_i\}$$ according to their relevance value $${\mathrm R}_{{\bar{\fancyscript{ A}}}_i}({\mathbb {D}})$$ with respect to the class labels $${\mathbb {D}}$$.

 

14.

Stop.

 


9.3.3 Fundamental Property


From the above discussions, the following properties corresponding to each cluster $${\mathbb {V}}_i$$ can be derived:

1.

$$\varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)\ge \delta ; \forall {\fancyscript{A}}_j \in {\mathbb {V}_i}$$.

 

2.

$${\mathrm {R}}_{{\fancyscript{ A}}_i}({\mathbb {D}}) \ge {\mathrm {R}}_{{\fancyscript{ A}}_j}({\mathbb {D}}); \forall {\fancyscript{A}}_j \in {\mathbb {V}_i}$$.

 

3.

$${\mathrm R}_{{\bar{\fancyscript{ A}}}_{i+j}}({\mathbb {D}}) \ge {\mathrm R}_{{\bar{\fancyscript{ A}}}_i}({\mathbb {D}}); \forall {\fancyscript{A}}_j \in {\bar{\mathbb {V}}_i}$$.

 

4.

$${\mathrm R}_{{\bar{\fancyscript{ A}}}_{i+j}}({\mathbb {D}}) < {\mathrm R}_{{\bar{\fancyscript{ A}}}_i}({\mathbb {D}}); \forall {\fancyscript{A}}_j \in {\mathbb {V}}_i \setminus {\bar{\mathbb {V}}_i}$$.

 

5.

$${\mathbb {V}_i} \cap {\mathbb {V}_k} = \emptyset , \forall i \ne k$$.

 
The property 1 says that if an attribute $${\fancyscript{A}}_j \in {\mathbb {V}_i} \Rightarrow \varPsi ({\fancyscript{A}}_i,{\fancyscript{A}}_j)\ge \delta $$. That is, the supervised similarity between the attribute $${\fancyscript{ A}}_j$$ of coarse cluster $${\mathbb {V}}_i$$ and the initial cluster representative $${\fancyscript{ A}}_i$$ is greater than a predefined threshold value $$\delta $$. The property 2 establishes the fact that if $${\fancyscript{A}}_j \in {\mathbb {V}_i} \Rightarrow {\mathrm {R}}_{{\fancyscript{ A}}_i}({\mathbb {D}}) \ge {\mathrm {R}}_{{\fancyscript{ A}}_j}({\mathbb {D}})$$, that is, the relevance of the cluster representative $${\fancyscript{ A}}_i$$ is the maximum among that of all attributes of the cluster $${\mathbb {V}_i}$$. The properties 3 and 4 are of great importance in increasing the relevance of augmented cluster representative with respect to the class labels and reducing the redundancy among the attribute set. The property 3 says that if $${\fancyscript{A}}_j \in {\bar{\mathbb {V}}_i} \Rightarrow {\mathrm R}_{{\bar{\fancyscript{ A}}}_{i+j}}({\mathbb {D}}) \ge {\mathrm R}_{{\bar{\fancyscript{ A}}}_i}({\mathbb {D}})$$. It means an attribute $${\fancyscript{A}}_j$$ belongs to the finer cluster $${\bar{\mathbb {V}}_i}$$ if and only if it increases the relevance value of the augmented cluster representative $${\bar{\fancyscript{A}}}_i$$. On the other hand, property 4 says that the attributes those belong to only coarse cluster $${\mathbb {V}_i}$$, not to finer cluster $${\bar{\mathbb {V}}_i}$$, are not responsible to increase the relevance of augmented cluster representative. Hence, the set of attributes $${\bar{\mathbb {V}}_i}$$ increases the relevance value of the attribute $${\fancyscript{ A}}_i$$ as well as reduces the redundancy of the whole set, while the set of attributes $${\mathbb {V}}_i \setminus {\bar{\mathbb {V}}_i}$$ is only responsible for reducing the redundancy. Finally, property 5 says that if an attribute $${\fancyscript{ A}}_i \in {\mathbb {V}}_i \Rightarrow {\fancyscript{A}}_i \notin {\mathbb {V}}_k, \forall k \ne i$$, that is, the attribute $${\fancyscript{A}}_i$$ is contained in $${\mathbb {V}}_i$$ only. Hence, the MISAC algorithm generates nonoverlapping attribute clusters.


9.3.4 Computational Complexity


The computation of the relevance of $$m$$ attributes is carried out in step 2 of the MISAC algorithm, which has $${\fancyscript{O}}(m)$$ time complexity. The cluster generation steps, that is steps 4–12, are executed $$c$$ times to generate $$c$$ clusters and corresponding augmented cluster representatives. There are three loops in the cluster generation steps, which are executed $$m$$, $$m$$, and $$m_i$$ times, respectively, where $$m_i < m$$ represents the cardinality of the cluster $${\mathbb {V}}_i$$. Each iteration of the loops takes only a constant amount of time. Hence, the complexity to generate $$c$$ clusters using steps 4–12 is $${\fancyscript{O}}(c(m+m_i))$$. The computing time of $${\fancyscript{O}}(c(m+m_i))$$ becomes $${\fancyscript{O}}(cm)$$ for any value of $$m_i$$. Finally, step 13 performs the sorting of $$c$$ augmented cluster representatives according to their relevance values, which has a computational complexity of $${\fancyscript{O}}(c^2)$$. Hence, the overall time complexity of the MISAC algorithm is $${\fancyscript{O}}(m+cm+c^2)$$, that is, $${\fancyscript{O}}(cm+c^2)$$. However, as the number of desired clusters $$c$$ is constant and sufficiently small compared to the total number of attributes $$m$$, the MISAC algorithm has an overall $${\fancyscript{O}}(m)$$ time complexity.

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

May 25, 2017 | Posted by in GENERAL & FAMILY MEDICINE | Comments Off on Mutual Information Based Supervised Attribute Clustering for Microarray Sample Classification

Full access? Get Clinical Tree

Get Clinical Tree app for offline access