-medoids, to select a set of bio-basis strings for amino acid sequence analysis. The comparative performance analysis of different relational clustering algorithms on bio-basis string selection problem is also reported in [19, 23].
In this chapter, a new string kernel function, termed as novel BBF [21, 22], is reported. It modifies existing BBF and is developed based on the principle of asymmetricity of biological dissimilarity. The dissimilarity is measured using an amino acid mutation matrix. The concept of zone of influence of the bio-basis string is introduced in the novel kernel function to normalize the asymmetric dissimilarity. It takes into account the influence or impact of each bio-basis string in nonnumerical sequence space. An efficient method, introduced in [21] is presented, which integrates the Fisher ratio and the concept of degree of resemblance to select most relevant and distinct bio-basis strings for the novel string kernel function. Instead of using symmetric similarity measure, the asymmetric biological dissimilarity is used to calculate the Fisher ratio, which is more effective for selection of most relevant bio-basis strings. The degree of resemblance enables efficient selection of a set of distinct bio-basis strings. In effect, it reduces the redundant features in numerical feature space. Some quantitative measures are presented to evaluate the quality of selected bio-basis strings. The effectiveness of the novel string kernel function and the new bio-basis string selection method, along with a comparison with existing BBF and related bio-basis string selection methods, is demonstrated on different protein data sets.
The structure of the rest of this chapter is as follows: Sect. 3.2 briefly introduces necessary notions of BBF, and the bio-basis string selection methods proposed by Berry et al. [8] and Yang and Thomson [41]. In Sect. 3.3, a novel string kernel function is presented, integrating the concepts of asymmetricity of biological dissimilarity and the zone of influence of bio-basis string. In Sect. 3.4, an efficient bio-basis string selection method is reported based on the Fisher ratio and the degree of resemblance. Some quantitative performance measures are presented in Sect. 3.5 to evaluate the quality of selected bio-basis strings. A few case studies and a comparison with existing string kernel function and related methods are presented in Sect. 3.6. Concluding remarks are given in Sect. 3.7.
3.2 String Kernel for Protein Functional Site Identification
In this section, the basic notion in the theory of bio-basis function is reported, along with the bio-basis string selection methods proposed by Yang and Thomson [41] and Berry et al. [8].
3.2.1 Bio-Basis Function
A widely used method in sequence analysis is the sequence alignment [2, 3]. In this method, the function of a sequence is annotated through aligning a novel sequence with known sequences. If the alignment between a novel sequence and a known sequence gives a very high similarity (homology) score, the novel sequence is believed to have the same or similar function as the known sequence. In this method, an amino acid mutation matrix is commonly used. Each mutation matrix has 20 columns and 20 rows. A value at the
th row and
th column is a probability or a likelihood value that the
th amino acid mutates to the
th amino acid after a particular evolutionary time [15, 17]. The mutation probabilities as similarities among amino acids are, therefore, metrics. The Dayhoff matrix (Table 3.1) was the first mutation matrix developed in 1978 [13] and many new mutation matrices were developed later on, for instance, the Blosum62 matrix [15]. However, the above method may not be useful directly for subsequence analysis. Because, a subsequence may not contain enough information for conventional alignment.




Table 3.1
Dayhoff mutation matrix: 1 point mutation is accepted per 100 residues (PAM1)
A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 40 | 24 | 32 | 32 | 16 | 36 | 28 | 28 | 28 | 24 | 28 | 32 | 36 | 32 | 24 | 36 | 36 | 32 | 8 | 20 |
C | 24 | 80 | 12 | 12 | 16 | 20 | 20 | 24 | 12 | 8 | 12 | 16 | 20 | 12 | 16 | 32 | 24 | 24 | 0 | 32 |
D | 32 | 12 | 48 | 44 | 8 | 36 | 36 | 24 | 32 | 16 | 20 | 40 | 28 | 40 | 28 | 32 | 32 | 24 | 4 | 16 |
E | 32 | 12 | 44 | 48 | 12 | 32 | 36 | 24 | 32 | 20 | 24 | 36 | 28 | 40 | 28 | 32 | 32 | 24 | 4 | 16 |
F | 16 | 16 | 8 | 12 | 68 | 12 | 24 | 36 | 12 | 40 | 32 | 16 | 12 | 12 | 16 | 20 | 20 | 28 | 32 | 60 |
G | 36 | 20 | 36 | 32 | 12 | 52 | 24 | 20 | 24 | 16 | 20 | 32 | 28 | 28 | 20 | 36 | 32 | 28 | 4 | 12 |
H | 28 | 20 | 36 | 36 | 24 | 24 | 56 | 24 | 32 | 24 | 24 | 40 | 32 | 44 | 40 | 28 | 28 | 24 | 20 | 32 |
I | 28 | 24 | 24 | 24 | 36 | 20 | 24 | 52 | 24 | 40 | 40 | 24 | 24 | 24 | 24 | 28 | 32 | 48 | 12 | 28 |
K | 28 | 12 | 32 | 32 | 12 | 24 | 32 | 24 | 52 | 20 | 32 | 36 | 28 | 36 | 44 | 32 | 32 | 24 | 20 | 16 |
L | 24 | 8 | 16 | 20 | 40 | 16 | 24 | 40 | 20 | 56 | 48 | 20 | 20 | 24 | 20 | 20 | 24 | 40 | 24 | 28 |
M | 28 | 12 | 20 | 24 | 32 | 20 | 24 | 40 | 32 | 48 | 56 | 24 | 24 | 28 | 32 | 24 | 28 | 40 | 16 | 24 |
N | 32 | 16 | 40 | 36 | 16 | 32 | 40 | 24 | 36 | 20 | 24 | 40 | 28 | 36 | 32 | 36 | 32 | 24 | 16 | 24 |
P | 36 | 20 | 28 | 28 | 12 | 28 | 32 | 24 | 28 | 20 | 24 | 28 | 56 | 32 | 32 | 36 | 32 | 28 | 8 | 12 |
Q | 32 | 12 | 40 | 40 | 12 | 28 | 44 | 24 | 36 | 24 | 28 | 36 | 32 | 48 | 36 | 28 | 28 | 24 | 12 | 16 |
R | 24 | 16 | 28 | 28 | 16 | 20 | 40 | 24 | 44 | 20 | 32 | 32 | 32 | 36 | 56 | 32 | 28 | 24 | 40 | 16 |
S | 36 | 32 | 32 | 32 | 20 | 36 | 28 | 28 | 32 | 20 | 24 | 36 | 36 | 28 | 32 | 40 | 36 | 28 | 24 | 20 |
T | 36 | 24 | 32 | 32 | 20 | 32 | 28 | 32 | 32 | 24 | 28 | 32 | 32 | 28 | 28 | 36 | 44 | 32 | 12 | 20 |
V | 32 | 24 | 24 | 24 | 28 | 28 | 24 | 48 | 24 | 40 | 40 | 24 | 28 | 24 | 24 | 28 | 32 | 48 | 8 | 24 |
W | 8 | 0 | 4 | 4 | 32 | 4 | 20 | 12 | 20 | 24 | 16 | 16 | 8 | 12 | 40 | 24 | 12 | 8 | 100 | 32 |
Y | 20 | 32 | 16 | 16 | 60 | 12 | 32 | 28 | 16 | 28 | 24 | 24 | 12 | 16 | 16 | 20 | 20 | 24 | 32 | 72 |
To alleviate this problem, the concept of BBF is introduced in [8, 35, 41] for subsequence analysis, which is based on the principle of conventional alignment technique. Using a table look-up technique, a homology score as a similarity value can be obtained for a pair of subsequences. The nongapped pairwise alignment technique is used to calculate this similarity value, where no deletion or insertion is used to align two subsequences [8, 35, 41]. For ease of discussions, in rest of the chapter, the following terminology is used.
The definition of BBF is as follows [35, 41]:

Suppose both
and
have
residues, the homology score between
and
is then defined as
![$$\begin{aligned} h(x_{j},v_i)=\sum _{k=1}^{m} {\mathbb M}(x_{j}[k],v_{i}[k]) \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ2.gif?w=960)
where
can be obtained from an amino acid mutation matrix through a table look-up method. Note that
and
is a set of 20 amino acids (Table 3.1).
be the set of 20 amino acids.
represents the total number of subsequences.
be the set of
subsequences with
residues,
.
represents the total number of bio-basis strings.
be the set of
bio-basis strings and
.
,
,
.

(3.1)





![$$\begin{aligned} h(x_{j},v_i)=\sum _{k=1}^{m} {\mathbb M}(x_{j}[k],v_{i}[k]) \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ2.gif?w=960)
(3.2)
![$${\mathbb M}(x_{j}[k],v_{i}[k])$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_IEq31.gif?w=960)
![$$x_{j}[k],v_{i}[k] \in {\mathbb A}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_IEq32.gif?w=960)

Consider two bio-basis strings
and
, and a subsequence
having
residues. The nongapped pairwise homology score is calculated between the subsequence
and each bio-basis string considering the mutation probabilities as in Table 3.1. For the first bio-basis string
, four mutation probabilities are
Hence, the homology score between the subsequence
and the bio-basis string
is given by
Similarly, for the second bio-basis string
, four mutation probabilities are 28, 28, 24, and 32. Thus, the value of
between the subsequence
and the bio-basis string
is as follows:
The maximum homology scores of two bio-basis strings
and
are given by
Considering the value of 
Hence, the value of BBF
is high if two subsequences
and
are similar or close to each other. The function value is one if two subsequences are identical, and small if they are distinct. The function needs a subsequence as a support (bio-basis string). Each bio-basis string is a feature dimension in a numerical feature space. If
is used to denote a collection of 20 amino acids, an input space of all potential subsequences with
residues is
. Then, a collection of
bio-basis strings formulates a numerical feature space
, to which a nonnumerical sequence space
is mapped for analysis. More importantly, the BBF can transform various homology scores to a real number as a similarity within the interval
, that is,

After the mapping using BBF, a nonnumerical subsequence space
will be mapped to a
-dimensional numerical feature space
, that is,
.






![$$\begin{aligned}&{\mathbb M}(x_{1}[1],v_{1}[1])={\mathbb M}(\text{ I, } \text{ K })=24;~~ {\mathbb M}(x_{1}[2],v_{1}[2])={\mathbb M}(\text{ P, } \text{ P })=56;\\&{\mathbb M}(x_{1}[3],v_{1}[3])={\mathbb M}(\text{ R, } \text{ R })=56;~~ {\mathbb M}(x_{1}[4],v_{1}[4])={\mathbb M}(\text{ S, } \text{ T })=36. \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ49.gif?w=960)


![$$\begin{aligned} h(x_1,v_1)=\sum _{k=1}^{4} {\mathbb M}(x_{1}[k],v_{1}[k])=172. \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ50.gif?w=960)




![$$\begin{aligned} h(x_1,v_2)=\sum _{k=1}^{4} {\mathbb M}(x_{1}[k],v_{2}[k])=112. \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ51.gif?w=960)














![$$[0,1]$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_IEq58.gif?w=960)

(3.3)




3.2.2 Selection of Bio-Basis Strings Using Mutual Information
In [41], Yang and Thomson proposed a method for bio-basis string selection using mutual information [32]. The necessity for a bio-basis string to be an independent and informative feature can be determined by the shared information between the bio-basis string and the rest as well as the shared information between the bio-basis string and class label [41].
The mutual information is quantified as the difference between the initial uncertainty and the conditional uncertainty. Let
be a set of selected bio-basis strings,
a set of candidate bio-basis strings.
(empty) at the beginning. A prior probability of a bio-basis string
is referred as
. The initial uncertainty of
is defined as

Similarly, the joint entropy of two bio-basis strings
and
is given by

where
and
. The mutual information between
and
is, therefore, given by

However, the mutual information of
with respect to all the bio-basis strings in
is

Combining (3.6) and (3.7), we get [41]

Replacing
with the class label
, the mutual information

measures the mutual relationship between
and
. A bio-basis string whose
value is the largest will be selected as
and will make the largest contribution to modeling (discrimination using
) among all the remaining bio-basis strings in
. Therefore, there are two mutual information measurements for
, the mutual information between
and
(
) and the mutual information between
and
(
). In this method, the following criterion is used for the selection of bio-basis strings [38, 41]

where
is a constant. In the current study, the value of
is set at 0.7 to give more weightage in discrimination [38, 41]. The major drawback of the method proposed by Yang and Thomson [41] is a huge number of prior and joint probabilities are to be calculated, which makes the method computationally expensive.







(3.4)



(3.5)





(3.6)



(3.7)

(3.8)



(3.9)














(3.10)


3.2.3 Selection of Bio-Basis Strings Using Fisher Ratio
In [8], Berry et al. proposed a method to select a set
of
bio-basis strings from the whole set
of
subsequences based on their discriminant capability. The discriminant capability of each subsequence is calculated using the Fisher ratio [14]. The Fisher ratio is used to maximize the discriminant capability of a subsequence in terms of interclass separation (as large as possible) and intraclass spread between subsequences (as small as possible). The larger the Fisher ratio value, the larger the discriminant capability of the subsequence. Based on the values of the Fisher ratio,
subsequences of
can be ranked from the strongest discriminant capability to the weakest one. The method yields a set
of
subsequences from
as the bio-basis strings which possess good discriminant capability between two classes, having evolved from original data set.









However, the
subsequences of
would have different compositions of amino acids. Hence, they should have different pairwise alignment scores with the other subsequences of
. As the class properties of these training subsequences are known, these similarity values can be partitioned into two groups or classes (functional and nonfunctional), which are denoted as
and
, respectively. Denoting the similarity between two subsequences
and
as
, the mean and standard deviation values for these two groups with respect to the subsequence
are as follows:
![$$\begin{aligned} U_{A_i}&= E_A[h(x_j,x_i)]=\frac{1}{n_A} \sum h(x_j,x_i); \qquad \forall x_j \in X_A\end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ11.gif?w=960)
![$$\begin{aligned} U_{B_i}&= E_B[h(x_k,x_i)]=\frac{1}{n_B} \sum h(x_k,x_i); \qquad \forall x_k \in X_B \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ12.gif?w=960)
![$$\begin{aligned} \sigma _{A_i}^2&= E_A[h^2(x_j,x_i)]-[E_A[h(x_j,x_i)]]^2 \nonumber \\&=\frac{1}{n_A} \sum \{h(x_j,x_i)-U_{A_i}\}^2;~~\forall x_j \in X_A \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ13.gif?w=960)
![$$\begin{aligned} \sigma _{B_i}^2&= E_B[h^2(x_k,x_i)]-[E_B[h(x_k,x_i)]]^2 \nonumber \\&=\frac{1}{n_B} \sum \{h(x_k,x_i)-U_{B_i}\}^2;~~\forall x_k \in X_B \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ14.gif?w=960)
where
and
are the number of similarity values in
and
, respectively.
and
represent the zero-mean, first and second order moment of similarity, that is, expectation of
and
, respectively. Based on these four quantities, the discriminant capability of each subsequence can be measured using the Fisher ratio

where
![$$\begin{aligned} |U_{A_i}-U_{B_i}|= |E_A[h(x_j,x_i)]-E_B[h(x_k,x_i)]|; \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ16.gif?w=960)
and
![$$\begin{aligned} \sigma _{A_i}^2+\sigma _{B_i}^2 = \{E_A[h^2(x_j,x_i)]+E_B[h^2(x_k,x_i)]\} -\{[E_A[h(x_j,x_i)]]^2+[E_B[h(x_k,x_i)]]^2\}. \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ17.gif?w=960)
The basic steps of this method follows next:
However, the bio-basis strings in nonnumerical sequence space should be such that the similarity between pairs of bio-basis strings would be as minimum as possible. Each of them would then represent a unique feature in numerical feature space. The methods proposed in [8, 41] have not adequately addressed this problem. Also, not much attention has been paid to it earlier.









![$$\begin{aligned} U_{A_i}&= E_A[h(x_j,x_i)]=\frac{1}{n_A} \sum h(x_j,x_i); \qquad \forall x_j \in X_A\end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ11.gif?w=960)
(3.11)
![$$\begin{aligned} U_{B_i}&= E_B[h(x_k,x_i)]=\frac{1}{n_B} \sum h(x_k,x_i); \qquad \forall x_k \in X_B \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ12.gif?w=960)
(3.12)
![$$\begin{aligned} \sigma _{A_i}^2&= E_A[h^2(x_j,x_i)]-[E_A[h(x_j,x_i)]]^2 \nonumber \\&=\frac{1}{n_A} \sum \{h(x_j,x_i)-U_{A_i}\}^2;~~\forall x_j \in X_A \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ13.gif?w=960)
(3.13)
![$$\begin{aligned} \sigma _{B_i}^2&= E_B[h^2(x_k,x_i)]-[E_B[h(x_k,x_i)]]^2 \nonumber \\&=\frac{1}{n_B} \sum \{h(x_k,x_i)-U_{B_i}\}^2;~~\forall x_k \in X_B \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ14.gif?w=960)
(3.14)




![$$E[h(x_j,x_i)]$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_IEq116.gif?w=960)
![$$E[h^2(x_k,x_i)]$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_IEq117.gif?w=960)



(3.15)
![$$\begin{aligned} |U_{A_i}-U_{B_i}|= |E_A[h(x_j,x_i)]-E_B[h(x_k,x_i)]|; \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ16.gif?w=960)
(3.16)
![$$\begin{aligned} \sigma _{A_i}^2+\sigma _{B_i}^2 = \{E_A[h^2(x_j,x_i)]+E_B[h^2(x_k,x_i)]\} -\{[E_A[h(x_j,x_i)]]^2+[E_B[h(x_k,x_i)]]^2\}. \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ17.gif?w=960)
(3.17)
1.
Calculate the discriminant capabilities of all subsequences of
using the Fisher ratio as in (3.15).

2.
Rank all subsequences of
based on the values of Fisher ratio in descending order.

3.
Select first
subsequences from
as the set
of bio-basis strings.



3.3 Novel String Kernel Function
In this section, a novel string kernel function is presented [21] based on the concepts of biological dissimilarity and zone of influence of bio-basis string. Next, an efficient method is reported for selection of bio-basis strings integrating the Fisher ratio and the principle of degree of resemblance.
3.3.1 Asymmetricity of Biological Dissimilarity
Here, we define two asymmetric dissimilarities between two subsequences
and
as follows [21]:

where
denotes the dissimilarity of subsequence
from the subsequence
and
is the nongapped homology score between two subsequences
and
.



(3.18)






Consider two subsequences
and
with 4 residues. According to the Dayhoff mutation matrix (Table 3.1), the nongapped pairwise homology score between two subsequences
and
is, therefore,
=
= 100, while the maximum homology scores of two subsequences
and
are given by
and
, respectively. Hence, the dissimilarity of subsequence
from subsequence
is given by
whereas the dissimilarity of
from
is as follows:
Thus, the dissimilarity is asymmetric in nature, that is,

The asymmetricity reflects domain organizations of two subsequences
and
. When two subsequences
and
consist of the same single domain,
and
will be similar small values. However, suppose that
has one extra domain, then
becomes large even if
is small. These dissimilarities may be used for clustering of protein sequences or subsequences so that domain organizations are well reflected. The asymmetric property of the biological dissimilarity was also observed by Stojmirovic [33] and Itoh et al. [16]. The asymmetric dissimilarity might be a powerful tool to cluster sequences or subsequences and to explore gene/protein universe.

















(3.19)









3.3.2 Novel Bio-Basis Function
The design of novel string kernel function is based on the principle of asymmetric biological dissimilarity [21]. Using a table look-up technique, a biological dissimilarity is calculated for a pair of subsequences based on an amino acid mutation matrix. The nongapped pairwise alignment method is used to calculate this dissimilarity, where no deletion or insertion is used to align two subsequences. The definition of the novel bio-basis function (nBBF) is as follows [21]:

The parameter
in (3.20) represents the zone of influence of the
th bio-basis string
. It represents the variance of the bio-basis string
with respect to the subsequences nearest to it. In other words, if each bio-basis string is considered as a cluster prototype, then the zone of influence of it represents the radius of that cluster. The value of
could be the same for all bio-basis strings if all of them are expected to form similar clusters in nonnumerical sequence space. In general, it is desirable that
should relate to the overall size and shape of the cluster associated with the bio-basis string
. In the present research work, the following definition is used:

where
is the total number of subsequences having minimum dissimilarity from the
th bio-basis string
among all the bio-basis strings and
is the dissimilarity of the subsequence
from the
th bio-basis string
. In other words, the value of
represents the average dissimilarity of the input subsequences from their nearest bio-basis string
.

(3.20)








(3.21)









Hence, the novel string kernel function normalizes the asymmetric dissimilarity using the zone of influence or variance of the bio-basis string, rather than using maximum homology score of the bio-basis string as in (3.1).
3.4 Biological Dissimilarity Based String Selection Method
One of the main problem in BBF is how to select a reduced set of most relevant bio-basis strings. The bio-basis strings are the sections of biological sequences that are used for the transformation of biological data into a numerical feature space. Hence, the problem of selecting a set
of
subsequences as the bio-basis strings from the whole set
of
subsequences, where
, is a feature selection problem.





In real biological data analysis, the data set may contain a number of similar or redundant subsequences with low discriminant capability or relevance to the classes. The selection of such similar and nonrelevant subsequences as the bio-basis strings may lead to a reduction in the useful information in numerical feature space. Ideally, the selected bio-basis strings should have high discriminant capability with the classes while the similarity among them would be as low as possible. The subsequences with high discriminant capability are expected to be able to predict the classes of the subsequences. However, the prediction capability may be reduced if many similar subsequences are selected as the bio-basis strings. In contrast, a data set that contains subsequences not only with high relevance with respect to the classes but with low mutual redundancy is more effective in its prediction capability. Hence, to assess the effectiveness of the subsequences as the bio-basis strings, both the relevance and the redundancy or similarity need to be measured quantitatively. The bio-basis string selection method, reported in [21], addresses the above issues through following three phases:
An asymmetric biological dissimilarity based Fisher ratio is chosen here to compute the discriminant capability or relevance of each subsequence to the classes, while a novel concept of the degree of resemblance is used to calculate the mutual redundancy or similarity among subsequences. The nonrelevant subsequences are discarded using a nearest mean classifier [14]. Next the calculation of the Fisher ratio using asymmetric biological dissimilarity is provided along with the concept of degree of resemblance and the principle of nearest mean classifier.
1.
computation of the discriminant capability or relevance of each subsequence;
2.
determination of the nonrelevant subsequences; and
3.
computation of the similarity or redundancy among subsequences.
3.4.1 Fisher Ratio Using Biological Dissimilarity
In the method reported in [21], the Fisher ratio [14] is used to measure the discriminant capability or relevance of each subsequence
. The Fisher ratio is calculated based on the asymmetric biological dissimilarity. As the class labels of all training subsequences are known, the set
can be partitioned into two groups or classes
and
, respectively, where


Hence, each subsequence
should have
and
dissimilarity values with the subsequences of
and
, respectively. Denoting the dissimilarity of the subsequence
from the subsequence
as
, the mean and standard deviation values for the two classes
and
with respect to the subsequence
are as follows:




where
,
,
, and
represent the mean and standard deviation values of the subsequence
for two groups
and
, respectively. These four quantities are calculated based on the square of biological dissimilarity with respect to the subsequence
. Based on these four quantities, the discriminant capability of each subsequence
is computed using the Fisher ratio that is as follows:

Let
be the maximum homology score of the subsequence
. The above four quantities can then be written using
as
![$$\begin{aligned} \bar{U}_{A_i}&=\{\kappa _i^2 + E_A[h^2(x_j,x_i)]- 2\kappa _i E_A[h(x_j,x_i)]\};\end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ29.gif?w=960)
![$$\begin{aligned} \bar{U}_{B_i}&=\{\kappa _i^2 + E_B[h^2(x_k,x_i)]- 2\kappa _i E_B[h(x_k,x_i)]\}; \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ30.gif?w=960)
![$$\begin{aligned} \bar{\sigma }_{A_i}^2&= \{4\kappa _i^2(E_A[h^2(x_j,x_i)]-[E_A[h(x_j,x_i)]]^2)\nonumber \\&\quad -\,4\kappa _i(E_A[h^3(x_j,x_i)]-E_A[h(x_j,x_i)]E_A[h^2(x_j,x_i)])\nonumber \\&\quad -\,[E_A[h^2(x_j,x_i)]]^2+E_A[h^4(x_j,x_i)] \}; \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ31.gif?w=960)
and
![$$\begin{aligned} \bar{\sigma }_{B_i}^2&= \{4\kappa _i^2(E_B[h^2(x_k,x_i)]-[E_B[h(x_k,x_i)]]^2)\nonumber \\&\quad -\,4\kappa _i(E_B[h^3(x_k,x_i)]-E_B[h(x_k,x_i)]E_B[h^2(x_k,x_i)])\nonumber \\&\quad -\,[E_B[h^2(x_k,x_i)]]^2+E_B[h^4(x_k,x_i)] \}; \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ32.gif?w=960)
where
represents the zero-mean,
th order moment of similarity
between two subsequences
and
. Now, the numerator of (3.28) is given by
![$$\begin{aligned} |\bar{U}_{A_i}-\bar{U}_{B_i}|&= |\{E_A[h^2(x_j,x_i)]-E_B[h^2(x_k,x_i)]\} \nonumber \\&\quad -2 \kappa _i \{E_A[h(x_j,x_i)]-E_B[h(x_k,x_i)]\}|. \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ33.gif?w=960)
Hence, the numerator of (3.28) not only depends on the difference of zero-mean, first-order moment of similarity of two groups as in (3.15), it also takes into account the difference of zero-mean, second-order moment of similarity as well as the maximum homology score of the subsequence
. That is, the numerator of (3.28) depends on following three factors:
Similarly, the denominator of (3.28) contains the following terms:
![$$\begin{aligned} \bar{\sigma }_{A_i}^2+\bar{\sigma }_{B_i}^2&= [4\kappa _i^2\{E_A[h^2(x_j,x_i)]+E_B[h^2(x_k,x_i)]\}\nonumber \\&\quad -4\kappa _i^2\{[E_A[h(x_j,x_i)]]^2+[E_B[h(x_k,x_i)]]^2\}\nonumber \\&\quad -4\kappa _i\{E_A[h^3(x_j,x_i)]+E_B[h^3(x_k,x_i)]\}\nonumber \\&\quad +4\kappa _i\{E_A[h(x_j,x_i)]E_A[h^2(x_j,x_i)]\nonumber \\&\quad +E_B[h(x_k,x_i)]E_B[h^2(x_k,x_i)]\}\nonumber \\&\quad -\{[E_A[h^2(x_j,x_i)]]^2+[E_B[h^2(x_k,x_i)]]^2\}\nonumber \\&\quad +\{E_A[h^4(x_j,x_i)]+E_B[h^4(x_k,x_i)]\}]. \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ34.gif?w=960)
Hence, the denominator of (3.28) considers the zero-mean, higher order (upto fourth order) moment of similarity of two groups as well as the maximum homology score of the subsequence
, while that of (3.15) only takes into account the zero-mean, first- and second-order moment of similarity of two groups and does not consider the maximum homology score of the subsequence
. In effect, (3.28) calculates the discriminant capability of each subsequence
more accurately.





(3.22)

(3.23)












(3.24)

(3.25)

(3.26)

(3.27)










(3.28)



![$$\begin{aligned} \bar{U}_{A_i}&=\{\kappa _i^2 + E_A[h^2(x_j,x_i)]- 2\kappa _i E_A[h(x_j,x_i)]\};\end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ29.gif?w=960)
(3.29)
![$$\begin{aligned} \bar{U}_{B_i}&=\{\kappa _i^2 + E_B[h^2(x_k,x_i)]- 2\kappa _i E_B[h(x_k,x_i)]\}; \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ30.gif?w=960)
(3.30)
![$$\begin{aligned} \bar{\sigma }_{A_i}^2&= \{4\kappa _i^2(E_A[h^2(x_j,x_i)]-[E_A[h(x_j,x_i)]]^2)\nonumber \\&\quad -\,4\kappa _i(E_A[h^3(x_j,x_i)]-E_A[h(x_j,x_i)]E_A[h^2(x_j,x_i)])\nonumber \\&\quad -\,[E_A[h^2(x_j,x_i)]]^2+E_A[h^4(x_j,x_i)] \}; \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ31.gif?w=960)
(3.31)
![$$\begin{aligned} \bar{\sigma }_{B_i}^2&= \{4\kappa _i^2(E_B[h^2(x_k,x_i)]-[E_B[h(x_k,x_i)]]^2)\nonumber \\&\quad -\,4\kappa _i(E_B[h^3(x_k,x_i)]-E_B[h(x_k,x_i)]E_B[h^2(x_k,x_i)])\nonumber \\&\quad -\,[E_B[h^2(x_k,x_i)]]^2+E_B[h^4(x_k,x_i)] \}; \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ32.gif?w=960)
(3.32)
![$$E[h^r(x_j,x_i)]$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_IEq204.gif?w=960)




![$$\begin{aligned} |\bar{U}_{A_i}-\bar{U}_{B_i}|&= |\{E_A[h^2(x_j,x_i)]-E_B[h^2(x_k,x_i)]\} \nonumber \\&\quad -2 \kappa _i \{E_A[h(x_j,x_i)]-E_B[h(x_k,x_i)]\}|. \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ33.gif?w=960)
(3.33)

difference of zero-mean, first order moment of similarity of two groups,;
difference of zero-mean, second-order moment of similarity of two groups,;
maximum homology score of the subsequence, that is,
.
![$$\begin{aligned} \bar{\sigma }_{A_i}^2+\bar{\sigma }_{B_i}^2&= [4\kappa _i^2\{E_A[h^2(x_j,x_i)]+E_B[h^2(x_k,x_i)]\}\nonumber \\&\quad -4\kappa _i^2\{[E_A[h(x_j,x_i)]]^2+[E_B[h(x_k,x_i)]]^2\}\nonumber \\&\quad -4\kappa _i\{E_A[h^3(x_j,x_i)]+E_B[h^3(x_k,x_i)]\}\nonumber \\&\quad +4\kappa _i\{E_A[h(x_j,x_i)]E_A[h^2(x_j,x_i)]\nonumber \\&\quad +E_B[h(x_k,x_i)]E_B[h^2(x_k,x_i)]\}\nonumber \\&\quad -\{[E_A[h^2(x_j,x_i)]]^2+[E_B[h^2(x_k,x_i)]]^2\}\nonumber \\&\quad +\{E_A[h^4(x_j,x_i)]+E_B[h^4(x_k,x_i)]\}]. \end{aligned}$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_3_Chapter_Equ34.gif?w=960)
(3.34)



3.4.2 Nearest Mean Classifier
After computing the discriminant capability or relevance
of each subsequence
using the Fisher ratio according to (3.28), the nonrelevant subsequences are discarded based on a threshold value
. The subsequences those have the Fisher ratio values larger than or equal to the threshold value
are considered as the candidate bio-basis strings. The value of
is obtained using the concept of nearest mean classifier.






Stay updated, free articles. Join our Telegram channel

Full access? Get Clinical Tree

