-Biclusters
(1)
Indian Statistical Institute, Kolkata, West Bengal, India
Abstract
The advent of DNA microarray technologies has revolutionized the experimental study of gene expression. Microarrays have been used to study different kinds of biological processes.
10.1 Introduction
The advent of DNA microarray technologies has revolutionized the experimental study of gene expression. Microarrays have been used to study different kinds of biological processes. It enables monitoring the transcription levels of many thousands of genes, while the cell undergoes specific conditions or processes [20]. The applications of such technology range from gene functional annotation and genetic network reconstruction to diagnosis of disease conditions and characterizing effects of medical treatment.
Clustering is one of the most popular approaches for analyzing gene expression data [15, 25] and has proved to be successful in many applications such as discovering gene pathway, gene classification, and function prediction [15, 20, 25]. Traditional clustering methods such as hierarchical clustering [23], -means algorithm [22], and self organizing map [40] assume that genes in a cluster behave similarly over all the conditions. These methods produce reliable results for microarray experiments performed on homogeneous conditions. However, when the conditions of an experiment vary greatly, the assumption is no longer appropriate. In this regard, it is desirable to develop approaches that can detect those relevant conditions under which the behavior similarity between genes of a potential group exists. This leads to a promising paradigm of clustering, called biclustering.
The term biclustering [21] refers to a distinct class of clustering algorithms that performs simultaneous row–column clustering. The difference between clustering and biclustering is that clustering can be applied to either rows or columns of the data matrix separately. Biclustering, on the other hand, performs clustering in these two dimensions simultaneously. This means that clustering derives a global model while biclustering produces a local model. When clustering algorithms are used, each gene in a given gene cluster is defined using all conditions. Similarly, each condition in a condition cluster is characterized by the activity of all the genes that belong to it. However, each gene in a bicluster is selected using only a subset of the conditions and each condition in a bicluster is selected using only a subset of the genes. The goal of biclustering techniques is thus to identify subgroups of genes and subgroups of conditions, by performing simultaneous clustering of both rows and columns of gene expression matrix, instead of clustering these two dimensions separately.
Hence, unlike clustering algorithms, biclustering algorithms identify groups of genes that show similar activity patterns under a specific subset of the experimental conditions. Therefore, biclustering approach is the key technique to use when one or more of the following situations applies: (i) only a small set of the genes participates in a cellular process of interest; (ii) an interesting cellular process is active only in a subset of conditions; and (iii) a single gene may participate in multiple pathways that may or may not be coactive under all conditions. For these reasons, biclustering should identify groups of genes and conditions, obeying the following restrictions: (i) a cluster of genes should be defined with respect to only a subset of conditions; (ii) a cluster of conditions should be defined with respect to only a subset of genes; and (iii) the clusters should not be exclusive and/or exhaustive: a gene or condition should be able to belong to more than one cluster or to no cluster at all and be grouped using a subset of conditions or genes. Additionally, robustness in biclustering algorithms is especially relevant because of two additional characteristics of the systems under study. The first characteristic is the sheer complexity of gene regulation processes that require powerful analysis tools. The second characteristic is the level of noise in actual gene expression experiments that makes the use of intelligent statistical tools indispensable.
The term biclustering was initially introduced by Hartigan [21] as a way to cluster rows and columns of a matrix simultaneously for finding biclusters with minimum row variance. Based on the row variance, Tibshirani et al. [43] proposed a permutation-based method to induce the optimal number of biclusters. In addition to use the row variance defined by Hartigan [21], Cho et al. [12] also used the total squared residue to quantify the homogeneity of a bicluster. Therefore, their framework is applicable for finding both constant biclusters and value-coherent biclusters. However, the term biclustering was first used by Cheng and Church in gene expression data analysis [11] and an additive biclustering model was proposed in [11], for gene expression data by introducing the residue of an element in the bicluster and the mean squared residue of a submatrix. In addition, this method adjusts that measure to reject trivial biclusters by means of row variance. Yang et al. [51] generalized this additive biclustering model to incorporate null values and proposed a probabilistic algorithm, called flexible overlapped biclustering algorithm, that can discover a set of possibly overlapping biclusters simultaneously. Getz et al. [19] presented a coupled two-way clustering algorithm that uses hierarchical clustering applied separately to each dimension and then combines both results to get final biclusters. Tang et al. [42] presented an interrelated two-way clustering algorithm that combines the results of one-way clustering on both gene and sample dimensions to produce biclusters. Obviously, the quality of biclusters produced by these methods depends on the clusters generated at each dimension.
To discover row-constant or column-constant biclusters hidden in noisy data, several approaches have been proposed such as the model proposed by Califano et al. [7], Bayesian-based approach of Sheng et al. [37], probabilistic model of Segal et al. [36], Gibbs sampling-based scheme of Wu et al. [49], plaid model of Lazzeroni and Owen [28], order preserving submatrix model of Ben-Dor et al. [1] and Liu and Wang [30], model of Murali and Kasif [33], and bipartite graph-based model of Tanay et al. [41]. To address the biclustering problem, recently several stochastic search techniques such as simulated annealing [6], evolutionary algorithm [14], and genetic algorithm [9] have also been employed. In [14], Divina and Aguilar-Ruiz proposed a biclustering method based on evolutionary algorithms that searches for biclusters following a sequential covering strategy. The main objective of this approach is to find biclusters of maximum size with mean squared residue lower than a given threshold. It also looks for biclusters with a relatively high row variance and with a low level of overlapping among biclusters. Lee et al. [29] and Sill et al. [38] used sparse singular value decomposition method to address the biclustering problem, while Sutheeworapong et al. [39] proposed a biclustering approach to analyze gene expression data with iterative optimization technique. Some other notable works are reported in [10, 35]. A survey on different biclustering algorithms for biological data analysis is reported in [31]. Also, the comparative analysis of different biclustering algorithms can be found in [17].
However, most of the algorithms reported earlier find exclusive biclusters, which is inappropriate in the biological context. Since biological processes are not independent of each other, many genes participate in multiple different processes. Each gene, therefore, should be assigned to multiple biclusters whenever biclusters are identified with processes. Hence, one of the main problems with biclustering technique is the uncertainty. Some of the sources of this uncertainty include incompleteness and vagueness in bicluster definitions of microarray data. In this background, two major soft computing techniques, namely, fuzzy sets theory [52] and rough sets theory [34], have gained popularity in modeling and propagating uncertainty. Recently, using rough sets and fuzzy sets, some biclustering algorithms have been proposed to discover value-coherent overlapping biclusters [8, 18, 44–48]. In these methods, the memberships of an entire row or column in different biclusters are computed directly to find out overlapping biclusters without considering the membership of each element or point of the gene expression data. The membership of an element is obtained either by multiplying or averaging the memberships of corresponding row and column of that element. In effect, there is no direct relation between the membership of an element and its residue value. However, the membership of each element must be dependent on its residue value directly to generate highly coherent biclusters.
In this chapter, a novel possibilistic biclustering (PBC) algorithm [13] is presented. The PBC is based on the concept of possibilistic clustering algorithm of Krishnapuram and Keller [27]. The PBC algorithm employs an iterative process to find biclusters of larger volume with stronger coherence and particularly with a high degree of overlapping. During each iteration of the PBC algorithm, the possibilistic memberships of all elements or points with respect to every bicluster are calculated and finally, the membership of an entire row or column is calculated from the memberships of the elements present in that row or column. Depending on that membership value, the biclusters are enlarged or degenerated. The main difference between the PBC method and other existing overlapping methods is that the PBC method offers an efficient way to calculate the membership of every point rather than the membership of an entire row or column, which is desirable to generate value-coherent biclusters. Also, instead of taking contribution of all rows and columns, the PBC method considers only those rows and columns whose membership values are greater than a given threshold. Based on this criterion, a novel possibilistic biclustering algorithm is presented in this chapter to discover a set of overlapping biclusters with mean squared residue lesser than a predefined threshold . A mathematical analysis on the convergence property of the PBC algorithm is also reported. Some quantitative measures are presented for evaluating the quality of selected overlapping biclusters. The effectiveness of the PBC algorithm, along with a comparison with the existing algorithms, is demonstrated on yeast microarray data.
The structure of the rest of this chapter is as follows: Sect. 10.2 briefly introduces necessary definitions related to biclustering method. In Sect. 10.3, a new possibilistic biclustering algorithm, termed as the PBC, is presented. Some quantitative performance measures are introduced in Sect. 10.4, along with some existing measures, to evaluate the quality of selected biclusters. In Sect. 10.5, extensive experimental results of the PBC algorithm are discussed and compared to those generated by the algorithms of Cheng and Church [11], Divina and Aguilar-Ruiz [14], and Yang et el. [51]. Concluding remarks are given in Sect. 10.6.
10.2 Biclustering and Possibilistic Clustering
This section presents a brief introduction to the basic notions of possibilistic clustering and biclustering method.
10.2.1 Basics of Biclustering
Let and be the set of genes and set of experimental conditions involved in gene expression data measurement, respectively. The result can be represented by a matrix with the set of rows and set of columns . Each element corresponds to the expression information of gene in experiment . A bicluster of a gene expression data is defined to be a subset of genes that exhibit similar behavior under a subset of experimental conditions, and vice versa. Thus, in the gene expression data matrix , a bicluster will appear as a submatrix of . This submatrix is denoted by the pair (), where and . The volume of a bicluster is defined as the number of elements such that and that will appear as a submatrix of .
Fig. 10.1
Expression matrix: row and column represent gene and condition, respectively
Example 10.1
Suppose the expression matrix consists of genes and conditions as shown in Fig. 10.1, where the rows of the matrix represent the genes and the columns represent the conditions. Then, a bicluster defined over the matrix could be ({1, 3, 5}, {2, 4, 7}), thus consisting of genes and of conditions . The volume of this bicluster is . In Fig. 10.1, the elements belonging to the bicluster are highlighted.
Definition 10.1
Note that in the above definitions, and correspond to the means of the th row and th column of the bicluster (), respectively.
Definition 10.2
The residue of an entry of a bicluster () is given by
In order to quantify the difference between the actual value and expected value of an entry predicted from the corresponding gene base, condition base, and bicluster base, the concept of residue is used. The residue is an indicator of the degree of coherence of an element with respect to the remaining ones in the bicluster, given the tendency of the relevant gene and the relevant condition. The lower the residue, the stronger the coherence.
(10.4)
Definition 10.3
The mean squared residue of a bicluster () is defined as follows [11]:
The mean squared residue is the variance of the set of all elements in the bicluster, plus the mean row variance and the mean column variance. The lower the mean squared residue, the stronger the coherence exhibited by the bicluster, and the better the quality of the bicluster. If a bicluster has a mean squared residue lower than a given threshold , then the bicluster is termed as a -bicluster.
(10.5)
Definition 10.4
Let () be a bicluster, then the row variance of () is defined as follows:
The row variance may be referred to be relatively large to reject trivial bicluster. By using row variance as an accompanying score, one wants to guarantee that the bicluster captures genes exhibiting fluctuating yet coherent trends under some set of conditions.
(10.6)
10.2.2 Possibilistic Clustering
One of the most widely used prototype-based fuzzy partitional clustering algorithms is fuzzy -means [4]. It offers the opportunity to deal with the data that belongs to more than one cluster at the same time. It assigns memberships to an object which are inversely related to the relative distance of the object to cluster prototypes. Also, it can deal with the uncertainties arising from overlapping cluster boundaries.
However, the fuzzy -means may be inaccurate in a noisy environment [27] as the resulting membership values do not always correspond well to the degrees of belonging of the data. In real data analysis, noise and outliers are unavoidable. Hence, to reduce this weakness, and to produce memberships that have a good explanation of the degrees of belonging for the data, Krishnapuram and Keller [27] proposed the possibilistic -means algorithm that uses a possibilistic type of membership function to describe the degree of belonging. It partitions a set of objects into clusters by minimizing the objective function
where represents the scale parameter, is a weighting exponent called the fuzzifier, is the membership matrix, and the value of depends only on the similarity between the object and the centroid . The resulting partition of the data can be interpreted as a possibilistic partition, and the membership values may be interpreted as degrees of possibility of the objects belonging to the classes, that is, the compatibilities of the objects with the centroids. A brief description of both fuzzy -means and possibilistic -means algorithms is available in Chap 8.
(10.7)
10.3 Possibilistic Biclustering Algorithm
In this section, a new possibilistic biclustering (PBC) algorithm, proposed by Das and Maji [13], is presented, incorporating the concept of possibilistic clustering algorithm of Krishnapuram and Keller [27] into biclustering framework. The integration of the possibilistic membership of fuzzy sets and biclustering algorithm makes the PBC algorithm effective for generating value-coherent overlapping biclusters with mean squared residue lower than a predefined threshold.
10.3.1 Objective Function
The PBC algorithm partitions a set of elements or points of a gene expression data into biclusters by minimizing the following objective function
where represents the possibilistic membership of the element or point into the th bicluster represented as (), is the fuzzifier, and is the total number of biclusters. The term is the residue of the element in the th bicluster, which has the similar expression as that in (10.4) and is given by
where , , and represent the means or bases of the th gene, th condition, and th bicluster, respectively. Hence, the first term of (10.8) is the fuzzy squared residue of the element in the th bicluster. Note that the term in (10.8) is monotone decreasing function of . This forces to be as large as possible to avoid the trivial solutions.
(10.8)
(10.9)
The parameter determines the residue value for which the membership value of a point or element becomes 0.5. Hence, it needs to be chosen depending on the desired bandwidth of the possibility (membership) distribution for each bicluster. This value could be the same for all biclusters if all biclusters are expected to be similar. In general, it is desirable that relates to the overall size and shape of bicluster . Also, it is to be noted that determines the relative degree to which the second term in the objective function of (10.8) is important compared with the first. If the two terms are to be weighted roughly equally, then should be of the order of . In this work, the following definition is used:
This choice makes proportional to the fuzzy mean squared residue of bicluster . The denominator of represents the fuzzy cardinality of bicluster . Typically, is chosen to be 1.
(10.10)
Solving the objective function of (10.8) with respect to , we get the possibilistic membership of an element into th bicluster that follows next:
Hence, in each iteration, the updated value of depends only on the residue of the element with respect to the th bicluster . If the value of is zero, then the membership of that point with respect to the th bicluster is 1.0. If the value of is greater than zero, then the membership value of that point is less than 1.0 and 0.0 when . Hence, .
(10.11)
After computing the membership of each point with respect to all biclusters, the memberships of all rows and columns are computed with respect to all biclusters. If the th bicluster has rows and columns, then the membership of the th row in the th bicluster is defined as follows:
and the membership of the th column in the th bicluster is
In each iteration, if the maximum membership value of the th row (respectively, th column) is greater than a threshold and if the difference between the maximum membership and the membership with respect to a particular bicluster of that row (respectively, column), which is greater than , is less than a threshold , then this row (respectively, column) will be selected for insertion into that particular bicluster. In this way, all rows and columns are examined with respect to all biclusters and are inserted successfully into the biclusters. Hence, the rows and columns, which are inserted in this process, have the membership values greater than . Here, the threshold is used to generate highly coherent biclusters. On the other hand, if the membership of a row or column is very high in some biclusters, then the row or column is inserted into only those biclusters. This is adjusted using the predefined threshold .
(10.12)
(10.13)
After all insertion, the mean squared residue of each bicluster is computed and compared with a given threshold . If the mean squared residue value of the th bicluster is greater than , then the row or column with least membership value in that bicluster is deleted and the process is repeated until the value becomes less than . In this way, the PBC algorithm generate highly coherent overlapping biclusters with lower mean squared residue value.
10.3.2 Bicluster Means
In each iteration, the means or bases of the th gene, th condition, and th bicluster are calculated to compute the mean squared residue of each bicluster based on the possibilistic membership values of all rows and columns in different biclusters. The means or bases of the th gene, th condition, and th bicluster for the PBC algorithm are obtained by solving (10.8) with respect to , , and , respectively:
Hence, the base of the th gene in the th bicluster depends on the means of all conditions present in that bicluster as well as the base of that bicluster. Similarly, the base of the th condition in the bicluster depends on the means of all genes present in the and the base of .
(10.14)
(10.15)
(10.16)
10.3.3 Convergence Condition
In this subsection, a mathematical analysis on the convergence property of the PBC algorithm is presented. In the PBC algorithm, the means or bases of the th gene, th condition, and th bicluster are calculated using (10.14), (10.15), and (10.16), respectively. However, these three equations can be written as
Hence, (10.17) represents a set of linear equations in terms of if , , and are kept constant. Similarly, (10.18) and (10.19) represent a set of linear equations in terms of and , respectively. A simple way to analyze the convergence property of the algorithm is to view (10.14), (10.15), and (10.16) as the Gauss-Seidel iterations for solving the set of linear equations. The Gauss-Seidel algorithm is guaranteed to converge if the matrix representing each equation is diagonally dominant [24]. This is a sufficient condition, not a necessary one. The iteration may or may not converge if the matrix is not diagonally dominant [24]. The matrix corresponding to (10.14) is given by:
Similarly, the matrices corresponding to (10.15) and (10.16) are given by
For , , and to be diagonally dominant, we must have
This is the sufficient condition for matrices , , and to be diagonally dominant. Under this condition, the iteration would converge if (10.14–10.16) are, repetitively, applied with kept constant. In practice, (10.11–10.13) and (10.14–10.16) are applied alternatively in the iterations. The condition in (10.23) is still correct according to Bezdek’s convergence theorem of fuzzy -means algorithm [2, 3] and Yan’s convergence analysis of the fuzzy curve-tracing algorithm [50]. All three matrices , , and are also the Hessian (second order derivative) of with respect to , , and . As all three matrices , , and are diagonally dominant, all their eigenvalues are positive. Also, the Hessian of with respect to can be easily shown to be diagonal matrix and are positive definite. Thus, according to the theorem derived by Bezdek [2, 3] and the analysis done by Yan [50], it can be concluded that the PBC algorithm converges, at least along a subsequence, to a local optimum solution as long as the condition in (10.23) is satisfied. Intuitively, the objective function reduces in all steps corresponding to (10.11–10.13) and (10.14–10.16), so the compound procedure makes the function descent strictly.
(10.17)
(10.18)
(10.19)
(10.20)
(10.21)
(10.22)
(10.23)