-medoids, to select a set of bio-basis strings for amino acid sequence analysis. The comparative performance analysis of different relational clustering algorithms on bio-basis string selection problem is also reported in [19, 23].
In this chapter, a new string kernel function, termed as novel BBF [21, 22], is reported. It modifies existing BBF and is developed based on the principle of asymmetricity of biological dissimilarity. The dissimilarity is measured using an amino acid mutation matrix. The concept of zone of influence of the bio-basis string is introduced in the novel kernel function to normalize the asymmetric dissimilarity. It takes into account the influence or impact of each bio-basis string in nonnumerical sequence space. An efficient method, introduced in [21] is presented, which integrates the Fisher ratio and the concept of degree of resemblance to select most relevant and distinct bio-basis strings for the novel string kernel function. Instead of using symmetric similarity measure, the asymmetric biological dissimilarity is used to calculate the Fisher ratio, which is more effective for selection of most relevant bio-basis strings. The degree of resemblance enables efficient selection of a set of distinct bio-basis strings. In effect, it reduces the redundant features in numerical feature space. Some quantitative measures are presented to evaluate the quality of selected bio-basis strings. The effectiveness of the novel string kernel function and the new bio-basis string selection method, along with a comparison with existing BBF and related bio-basis string selection methods, is demonstrated on different protein data sets.
The structure of the rest of this chapter is as follows: Sect. 3.2 briefly introduces necessary notions of BBF, and the bio-basis string selection methods proposed by Berry et al. [8] and Yang and Thomson [41]. In Sect. 3.3, a novel string kernel function is presented, integrating the concepts of asymmetricity of biological dissimilarity and the zone of influence of bio-basis string. In Sect. 3.4, an efficient bio-basis string selection method is reported based on the Fisher ratio and the degree of resemblance. Some quantitative performance measures are presented in Sect. 3.5 to evaluate the quality of selected bio-basis strings. A few case studies and a comparison with existing string kernel function and related methods are presented in Sect. 3.6. Concluding remarks are given in Sect. 3.7.
3.2 String Kernel for Protein Functional Site Identification
In this section, the basic notion in the theory of bio-basis function is reported, along with the bio-basis string selection methods proposed by Yang and Thomson [41] and Berry et al. [8].
3.2.1 Bio-Basis Function
A widely used method in sequence analysis is the sequence alignment [2, 3]. In this method, the function of a sequence is annotated through aligning a novel sequence with known sequences. If the alignment between a novel sequence and a known sequence gives a very high similarity (homology) score, the novel sequence is believed to have the same or similar function as the known sequence. In this method, an amino acid mutation matrix is commonly used. Each mutation matrix has 20 columns and 20 rows. A value at the th row and th column is a probability or a likelihood value that the th amino acid mutates to the th amino acid after a particular evolutionary time [15, 17]. The mutation probabilities as similarities among amino acids are, therefore, metrics. The Dayhoff matrix (Table 3.1) was the first mutation matrix developed in 1978 [13] and many new mutation matrices were developed later on, for instance, the Blosum62 matrix [15]. However, the above method may not be useful directly for subsequence analysis. Because, a subsequence may not contain enough information for conventional alignment.
Table 3.1
Dayhoff mutation matrix: 1 point mutation is accepted per 100 residues (PAM1)
A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 40 | 24 | 32 | 32 | 16 | 36 | 28 | 28 | 28 | 24 | 28 | 32 | 36 | 32 | 24 | 36 | 36 | 32 | 8 | 20 |
C | 24 | 80 | 12 | 12 | 16 | 20 | 20 | 24 | 12 | 8 | 12 | 16 | 20 | 12 | 16 | 32 | 24 | 24 | 0 | 32 |
D | 32 | 12 | 48 | 44 | 8 | 36 | 36 | 24 | 32 | 16 | 20 | 40 | 28 | 40 | 28 | 32 | 32 | 24 | 4 | 16 |
E | 32 | 12 | 44 | 48 | 12 | 32 | 36 | 24 | 32 | 20 | 24 | 36 | 28 | 40 | 28 | 32 | 32 | 24 | 4 | 16 |
F | 16 | 16 | 8 | 12 | 68 | 12 | 24 | 36 | 12 | 40 | 32 | 16 | 12 | 12 | 16 | 20 | 20 | 28 | 32 | 60 |
G | 36 | 20 | 36 | 32 | 12 | 52 | 24 | 20 | 24 | 16 | 20 | 32 | 28 | 28 | 20 | 36 | 32 | 28 | 4 | 12 |
H | 28 | 20 | 36 | 36 | 24 | 24 | 56 | 24 | 32 | 24 | 24 | 40 | 32 | 44 | 40 | 28 | 28 | 24 | 20 | 32 |
I | 28 | 24 | 24 | 24 | 36 | 20 | 24 | 52 | 24 | 40 | 40 | 24 | 24 | 24 | 24 | 28 | 32 | 48 | 12 | 28 |
K | 28 | 12 | 32 | 32 | 12 | 24 | 32 | 24 | 52 | 20 | 32 | 36 | 28 | 36 | 44 | 32 | 32 | 24 | 20 | 16 |
L | 24 | 8 | 16 | 20 | 40 | 16 | 24 | 40 | 20 | 56 | 48 | 20 | 20 | 24 | 20 | 20 | 24 | 40 | 24 | 28 |
M | 28 | 12 | 20 | 24 | 32 | 20 | 24 | 40 | 32 | 48 | 56 | 24 | 24 | 28 | 32 | 24 | 28 | 40 | 16 | 24 |
N | 32 | 16 | 40 | 36 | 16 | 32 | 40 | 24 | 36 | 20 | 24 | 40 | 28 | 36 | 32 | 36 | 32 | 24 | 16 | 24 |
P | 36 | 20 | 28 | 28 | 12 | 28 | 32 | 24 | 28 | 20 | 24 | 28 | 56 | 32 | 32 | 36 | 32 | 28 | 8 | 12 |
Q | 32 | 12 | 40 | 40 | 12 | 28 | 44 | 24 | 36 | 24 | 28 | 36 | 32 | 48 | 36 | 28 | 28 | 24 | 12 | 16 |
R | 24 | 16 | 28 | 28 | 16 | 20 | 40 | 24 | 44 | 20 | 32 | 32 | 32 | 36 | 56 | 32 | 28 | 24 | 40 | 16 |
S | 36 | 32 | 32 | 32 | 20 | 36 | 28 | 28 | 32 | 20 | 24 | 36 | 36 | 28 | 32 | 40 | 36 | 28 | 24 | 20 |
T | 36 | 24 | 32 | 32 | 20 | 32 | 28 | 32 | 32 | 24 | 28 | 32 | 32 | 28 | 28 | 36 | 44 | 32 | 12 | 20 |
V | 32 | 24 | 24 | 24 | 28 | 28 | 24 | 48 | 24 | 40 | 40 | 24 | 28 | 24 | 24 | 28 | 32 | 48 | 8 | 24 |
W | 8 | 0 | 4 | 4 | 32 | 4 | 20 | 12 | 20 | 24 | 16 | 16 | 8 | 12 | 40 | 24 | 12 | 8 | 100 | 32 |
Y | 20 | 32 | 16 | 16 | 60 | 12 | 32 | 28 | 16 | 28 | 24 | 24 | 12 | 16 | 16 | 20 | 20 | 24 | 32 | 72 |
To alleviate this problem, the concept of BBF is introduced in [8, 35, 41] for subsequence analysis, which is based on the principle of conventional alignment technique. Using a table look-up technique, a homology score as a similarity value can be obtained for a pair of subsequences. The nongapped pairwise alignment technique is used to calculate this similarity value, where no deletion or insertion is used to align two subsequences [8, 35, 41]. For ease of discussions, in rest of the chapter, the following terminology is used.
The definition of BBF is as follows [35, 41]:
Suppose both and have residues, the homology score between and is then defined as
where can be obtained from an amino acid mutation matrix through a table look-up method. Note that and is a set of 20 amino acids (Table 3.1).
be the set of 20 amino acids.
represents the total number of subsequences.
be the set of subsequences with residues, .
represents the total number of bio-basis strings.
be the set of bio-basis strings and .
, , .
(3.1)
(3.2)
Consider two bio-basis strings and , and a subsequence having residues. The nongapped pairwise homology score is calculated between the subsequence and each bio-basis string considering the mutation probabilities as in Table 3.1. For the first bio-basis string , four mutation probabilities are
Hence, the homology score between the subsequence and the bio-basis string is given by
Similarly, for the second bio-basis string , four mutation probabilities are 28, 28, 24, and 32. Thus, the value of between the subsequence and the bio-basis string is as follows:
The maximum homology scores of two bio-basis strings and are given by
Considering the value of
Hence, the value of BBF is high if two subsequences and are similar or close to each other. The function value is one if two subsequences are identical, and small if they are distinct. The function needs a subsequence as a support (bio-basis string). Each bio-basis string is a feature dimension in a numerical feature space. If is used to denote a collection of 20 amino acids, an input space of all potential subsequences with residues is . Then, a collection of bio-basis strings formulates a numerical feature space , to which a nonnumerical sequence space is mapped for analysis. More importantly, the BBF can transform various homology scores to a real number as a similarity within the interval , that is,
After the mapping using BBF, a nonnumerical subsequence space will be mapped to a -dimensional numerical feature space , that is, .
(3.3)
3.2.2 Selection of Bio-Basis Strings Using Mutual Information
In [41], Yang and Thomson proposed a method for bio-basis string selection using mutual information [32]. The necessity for a bio-basis string to be an independent and informative feature can be determined by the shared information between the bio-basis string and the rest as well as the shared information between the bio-basis string and class label [41].
The mutual information is quantified as the difference between the initial uncertainty and the conditional uncertainty. Let be a set of selected bio-basis strings, a set of candidate bio-basis strings. (empty) at the beginning. A prior probability of a bio-basis string is referred as . The initial uncertainty of is defined as
Similarly, the joint entropy of two bio-basis strings and is given by
where and . The mutual information between and is, therefore, given by
However, the mutual information of with respect to all the bio-basis strings in is
Combining (3.6) and (3.7), we get [41]
Replacing with the class label , the mutual information
measures the mutual relationship between and . A bio-basis string whose value is the largest will be selected as and will make the largest contribution to modeling (discrimination using ) among all the remaining bio-basis strings in . Therefore, there are two mutual information measurements for , the mutual information between and () and the mutual information between and (). In this method, the following criterion is used for the selection of bio-basis strings [38, 41]
where is a constant. In the current study, the value of is set at 0.7 to give more weightage in discrimination [38, 41]. The major drawback of the method proposed by Yang and Thomson [41] is a huge number of prior and joint probabilities are to be calculated, which makes the method computationally expensive.
(3.4)
(3.5)
(3.6)
(3.7)
(3.8)
(3.9)
(3.10)
3.2.3 Selection of Bio-Basis Strings Using Fisher Ratio
In [8], Berry et al. proposed a method to select a set of bio-basis strings from the whole set of subsequences based on their discriminant capability. The discriminant capability of each subsequence is calculated using the Fisher ratio [14]. The Fisher ratio is used to maximize the discriminant capability of a subsequence in terms of interclass separation (as large as possible) and intraclass spread between subsequences (as small as possible). The larger the Fisher ratio value, the larger the discriminant capability of the subsequence. Based on the values of the Fisher ratio, subsequences of can be ranked from the strongest discriminant capability to the weakest one. The method yields a set of subsequences from as the bio-basis strings which possess good discriminant capability between two classes, having evolved from original data set.
However, the subsequences of would have different compositions of amino acids. Hence, they should have different pairwise alignment scores with the other subsequences of . As the class properties of these training subsequences are known, these similarity values can be partitioned into two groups or classes (functional and nonfunctional), which are denoted as and , respectively. Denoting the similarity between two subsequences and as , the mean and standard deviation values for these two groups with respect to the subsequence are as follows:
where and are the number of similarity values in and , respectively. and represent the zero-mean, first and second order moment of similarity, that is, expectation of and , respectively. Based on these four quantities, the discriminant capability of each subsequence can be measured using the Fisher ratio
where
and
The basic steps of this method follows next:
However, the bio-basis strings in nonnumerical sequence space should be such that the similarity between pairs of bio-basis strings would be as minimum as possible. Each of them would then represent a unique feature in numerical feature space. The methods proposed in [8, 41] have not adequately addressed this problem. Also, not much attention has been paid to it earlier.
(3.11)
(3.12)
(3.13)
(3.14)
(3.15)
(3.16)
(3.17)
1.
Calculate the discriminant capabilities of all subsequences of using the Fisher ratio as in (3.15).
2.
Rank all subsequences of based on the values of Fisher ratio in descending order.
3.
Select first subsequences from as the set of bio-basis strings.
3.3 Novel String Kernel Function
In this section, a novel string kernel function is presented [21] based on the concepts of biological dissimilarity and zone of influence of bio-basis string. Next, an efficient method is reported for selection of bio-basis strings integrating the Fisher ratio and the principle of degree of resemblance.
3.3.1 Asymmetricity of Biological Dissimilarity
Here, we define two asymmetric dissimilarities between two subsequences and as follows [21]:
where denotes the dissimilarity of subsequence from the subsequence and is the nongapped homology score between two subsequences and .
(3.18)
Consider two subsequences and with 4 residues. According to the Dayhoff mutation matrix (Table 3.1), the nongapped pairwise homology score between two subsequences and is, therefore, = = 100, while the maximum homology scores of two subsequences and are given by and , respectively. Hence, the dissimilarity of subsequence from subsequence is given by
whereas the dissimilarity of from is as follows:
Thus, the dissimilarity is asymmetric in nature, that is,
The asymmetricity reflects domain organizations of two subsequences and . When two subsequences and consist of the same single domain, and will be similar small values. However, suppose that has one extra domain, then becomes large even if is small. These dissimilarities may be used for clustering of protein sequences or subsequences so that domain organizations are well reflected. The asymmetric property of the biological dissimilarity was also observed by Stojmirovic [33] and Itoh et al. [16]. The asymmetric dissimilarity might be a powerful tool to cluster sequences or subsequences and to explore gene/protein universe.
(3.19)
3.3.2 Novel Bio-Basis Function
The design of novel string kernel function is based on the principle of asymmetric biological dissimilarity [21]. Using a table look-up technique, a biological dissimilarity is calculated for a pair of subsequences based on an amino acid mutation matrix. The nongapped pairwise alignment method is used to calculate this dissimilarity, where no deletion or insertion is used to align two subsequences. The definition of the novel bio-basis function (nBBF) is as follows [21]:
The parameter in (3.20) represents the zone of influence of the th bio-basis string . It represents the variance of the bio-basis string with respect to the subsequences nearest to it. In other words, if each bio-basis string is considered as a cluster prototype, then the zone of influence of it represents the radius of that cluster. The value of could be the same for all bio-basis strings if all of them are expected to form similar clusters in nonnumerical sequence space. In general, it is desirable that should relate to the overall size and shape of the cluster associated with the bio-basis string . In the present research work, the following definition is used:
where is the total number of subsequences having minimum dissimilarity from the th bio-basis string among all the bio-basis strings and is the dissimilarity of the subsequence from the th bio-basis string . In other words, the value of represents the average dissimilarity of the input subsequences from their nearest bio-basis string .
(3.20)
(3.21)
Hence, the novel string kernel function normalizes the asymmetric dissimilarity using the zone of influence or variance of the bio-basis string, rather than using maximum homology score of the bio-basis string as in (3.1).
3.4 Biological Dissimilarity Based String Selection Method
One of the main problem in BBF is how to select a reduced set of most relevant bio-basis strings. The bio-basis strings are the sections of biological sequences that are used for the transformation of biological data into a numerical feature space. Hence, the problem of selecting a set of subsequences as the bio-basis strings from the whole set of subsequences, where , is a feature selection problem.
In real biological data analysis, the data set may contain a number of similar or redundant subsequences with low discriminant capability or relevance to the classes. The selection of such similar and nonrelevant subsequences as the bio-basis strings may lead to a reduction in the useful information in numerical feature space. Ideally, the selected bio-basis strings should have high discriminant capability with the classes while the similarity among them would be as low as possible. The subsequences with high discriminant capability are expected to be able to predict the classes of the subsequences. However, the prediction capability may be reduced if many similar subsequences are selected as the bio-basis strings. In contrast, a data set that contains subsequences not only with high relevance with respect to the classes but with low mutual redundancy is more effective in its prediction capability. Hence, to assess the effectiveness of the subsequences as the bio-basis strings, both the relevance and the redundancy or similarity need to be measured quantitatively. The bio-basis string selection method, reported in [21], addresses the above issues through following three phases:
An asymmetric biological dissimilarity based Fisher ratio is chosen here to compute the discriminant capability or relevance of each subsequence to the classes, while a novel concept of the degree of resemblance is used to calculate the mutual redundancy or similarity among subsequences. The nonrelevant subsequences are discarded using a nearest mean classifier [14]. Next the calculation of the Fisher ratio using asymmetric biological dissimilarity is provided along with the concept of degree of resemblance and the principle of nearest mean classifier.
1.
computation of the discriminant capability or relevance of each subsequence;
2.
determination of the nonrelevant subsequences; and
3.
computation of the similarity or redundancy among subsequences.
3.4.1 Fisher Ratio Using Biological Dissimilarity
In the method reported in [21], the Fisher ratio [14] is used to measure the discriminant capability or relevance of each subsequence . The Fisher ratio is calculated based on the asymmetric biological dissimilarity. As the class labels of all training subsequences are known, the set can be partitioned into two groups or classes and , respectively, where
Hence, each subsequence should have and dissimilarity values with the subsequences of and , respectively. Denoting the dissimilarity of the subsequence from the subsequence as , the mean and standard deviation values for the two classes and with respect to the subsequence are as follows:
where , , , and represent the mean and standard deviation values of the subsequence for two groups and , respectively. These four quantities are calculated based on the square of biological dissimilarity with respect to the subsequence . Based on these four quantities, the discriminant capability of each subsequence is computed using the Fisher ratio that is as follows:
Let be the maximum homology score of the subsequence . The above four quantities can then be written using as
and
where represents the zero-mean, th order moment of similarity between two subsequences and . Now, the numerator of (3.28) is given by
Hence, the numerator of (3.28) not only depends on the difference of zero-mean, first-order moment of similarity of two groups as in (3.15), it also takes into account the difference of zero-mean, second-order moment of similarity as well as the maximum homology score of the subsequence . That is, the numerator of (3.28) depends on following three factors:
Similarly, the denominator of (3.28) contains the following terms:
Hence, the denominator of (3.28) considers the zero-mean, higher order (upto fourth order) moment of similarity of two groups as well as the maximum homology score of the subsequence , while that of (3.15) only takes into account the zero-mean, first- and second-order moment of similarity of two groups and does not consider the maximum homology score of the subsequence . In effect, (3.28) calculates the discriminant capability of each subsequence more accurately.
(3.22)
(3.23)
(3.24)
(3.25)
(3.26)
(3.27)
(3.28)
(3.29)
(3.30)
(3.31)
(3.32)
(3.33)
difference of zero-mean, first order moment of similarity of two groups, ;
difference of zero-mean, second-order moment of similarity of two groups, ;
maximum homology score of the subsequence , that is, .
(3.34)
3.4.2 Nearest Mean Classifier
After computing the discriminant capability or relevance of each subsequence using the Fisher ratio according to (3.28), the nonrelevant subsequences are discarded based on a threshold value . The subsequences those have the Fisher ratio values larger than or equal to the threshold value are considered as the candidate bio-basis strings. The value of is obtained using the concept of nearest mean classifier.