Design of String Kernel to Predict Protein Functional Sites Using Kernel-Based Classifiers

-medoids, to select a set of bio-basis strings for amino acid sequence analysis. The comparative performance analysis of different relational clustering algorithms on bio-basis string selection problem is also reported in [19, 23].


In this chapter, a new string kernel function, termed as novel BBF [21, 22], is reported. It modifies existing BBF and is developed based on the principle of asymmetricity of biological dissimilarity. The dissimilarity is measured using an amino acid mutation matrix. The concept of zone of influence of the bio-basis string is introduced in the novel kernel function to normalize the asymmetric dissimilarity. It takes into account the influence or impact of each bio-basis string in nonnumerical sequence space. An efficient method, introduced in [21] is presented, which integrates the Fisher ratio and the concept of degree of resemblance to select most relevant and distinct bio-basis strings for the novel string kernel function. Instead of using symmetric similarity measure, the asymmetric biological dissimilarity is used to calculate the Fisher ratio, which is more effective for selection of most relevant bio-basis strings. The degree of resemblance enables efficient selection of a set of distinct bio-basis strings. In effect, it reduces the redundant features in numerical feature space. Some quantitative measures are presented to evaluate the quality of selected bio-basis strings. The effectiveness of the novel string kernel function and the new bio-basis string selection method, along with a comparison with existing BBF and related bio-basis string selection methods, is demonstrated on different protein data sets.

The structure of the rest of this chapter is as follows: Sect. 3.2 briefly introduces necessary notions of BBF, and the bio-basis string selection methods proposed by Berry et al. [8] and Yang and Thomson [41]. In Sect. 3.3, a novel string kernel function is presented, integrating the concepts of asymmetricity of biological dissimilarity and the zone of influence of bio-basis string. In Sect. 3.4, an efficient bio-basis string selection method is reported based on the Fisher ratio and the degree of resemblance. Some quantitative performance measures are presented in Sect. 3.5 to evaluate the quality of selected bio-basis strings. A few case studies and a comparison with existing string kernel function and related methods are presented in Sect. 3.6. Concluding remarks are given in Sect. 3.7.



3.2 String Kernel for Protein Functional Site Identification


In this section, the basic notion in the theory of bio-basis function is reported, along with the bio-basis string selection methods proposed by Yang and Thomson [41] and Berry et al. [8].


3.2.1 Bio-Basis Function


A widely used method in sequence analysis is the sequence alignment [2, 3]. In this method, the function of a sequence is annotated through aligning a novel sequence with known sequences. If the alignment between a novel sequence and a known sequence gives a very high similarity (homology) score, the novel sequence is believed to have the same or similar function as the known sequence. In this method, an amino acid mutation matrix is commonly used. Each mutation matrix has 20 columns and 20 rows. A value at the $$n$$th row and $$m$$th column is a probability or a likelihood value that the $$n$$th amino acid mutates to the $$m$$th amino acid after a particular evolutionary time [15, 17]. The mutation probabilities as similarities among amino acids are, therefore, metrics. The Dayhoff matrix (Table 3.1) was the first mutation matrix developed in 1978 [13] and many new mutation matrices were developed later on, for instance, the Blosum62 matrix [15]. However, the above method may not be useful directly for subsequence analysis. Because, a subsequence may not contain enough information for conventional alignment.


Table 3.1
Dayhoff mutation matrix: 1 point mutation is accepted per 100 residues (PAM1)







































































































































































































































































































































































































































































































 
A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

A

40

24

32

32

16

36

28

28

28

24

28

32

36

32

24

36

36

32

8

20

C

24

80

12

12

16

20

20

24

12

8

12

16

20

12

16

32

24

24

0

32

D

32

12

48

44

8

36

36

24

32

16

20

40

28

40

28

32

32

24

4

16

E

32

12

44

48

12

32

36

24

32

20

24

36

28

40

28

32

32

24

4

16

F

16

16

8

12

68

12

24

36

12

40

32

16

12

12

16

20

20

28

32

60

G

36

20

36

32

12

52

24

20

24

16

20

32

28

28

20

36

32

28

4

12

H

28

20

36

36

24

24

56

24

32

24

24

40

32

44

40

28

28

24

20

32

I

28

24

24

24

36

20

24

52

24

40

40

24

24

24

24

28

32

48

12

28

K

28

12

32

32

12

24

32

24

52

20

32

36

28

36

44

32

32

24

20

16

L

24

8

16

20

40

16

24

40

20

56

48

20

20

24

20

20

24

40

24

28

M

28

12

20

24

32

20

24

40

32

48

56

24

24

28

32

24

28

40

16

24

N

32

16

40

36

16

32

40

24

36

20

24

40

28

36

32

36

32

24

16

24

P

36

20

28

28

12

28

32

24

28

20

24

28

56

32

32

36

32

28

8

12

Q

32

12

40

40

12

28

44

24

36

24

28

36

32

48

36

28

28

24

12

16

R

24

16

28

28

16

20

40

24

44

20

32

32

32

36

56

32

28

24

40

16

S

36

32

32

32

20

36

28

28

32

20

24

36

36

28

32

40

36

28

24

20

T

36

24

32

32

20

32

28

32

32

24

28

32

32

28

28

36

44

32

12

20

V

32

24

24

24

28

28

24

48

24

40

40

24

28

24

24

28

32

48

8

24

W

8

0

4

4

32

4

20

12

20

24

16

16

8

12

40

24

12

8

100

32

Y

20

32

16

16

60

12

32

28

16

28

24

24

12

16

16

20

20

24

32

72

To alleviate this problem, the concept of BBF is introduced in [8, 35, 41] for subsequence analysis, which is based on the principle of conventional alignment technique. Using a table look-up technique, a homology score as a similarity value can be obtained for a pair of subsequences. The nongapped pairwise alignment technique is used to calculate this similarity value, where no deletion or insertion is used to align two subsequences [8, 35, 41]. For ease of discussions, in rest of the chapter, the following terminology is used.



  • $${\mathbb A}=\{\text{ A }, \text{ C }, \ldots , \text{ W }, \text{ Y }\}$$ be the set of 20 amino acids.


  • $$n$$ represents the total number of subsequences.


  • $$X=\{x_1,\ldots ,x_j,\ldots ,x_n\}$$ be the set of $$n$$ subsequences with $$m$$ residues, $$\forall x_j \in {\mathbb A}^m$$.


  • $$c$$ represents the total number of bio-basis strings.


  • $$V=\{v_1,\ldots ,v_i,\ldots ,v_c\}$$ be the set of $$c$$ bio-basis strings and $$\forall v_i \in X$$.


  • $$x_{j}[k] \in {\mathbb A}$$, $$v_{i}[k] \in {\mathbb A}$$, $$\forall _{k=1}^{m}$$.
The definition of BBF is as follows [35, 41]:


$$\begin{aligned} f(x_j,v_i)=\mathrm{{exp}} \left\{ \gamma _b \frac{h(x_j,v_i)-h(v_i,v_i)}{h(v_i,v_i)} \right\} \end{aligned}$$

(3.1)




  • $$h(x_j,v_i)$$ is the homology score between a subsequence $$x_j$$ and a bio-basis string $$v_i$$ calculated using an amino acid mutation matrix [8, 35, 41];


  • $$h(v_i,v_i)$$ denotes the maximum homology score of the $$i$$th bio-basis string $$v_i$$; and


  • $$\gamma _b$$ is a constant and typically chosen to be 1 [8, 35].
Suppose both $$x_{j}$$ and $$v_{i}$$ have $$m$$ residues, the homology score between $$x_j$$ and $$v_i$$ is then defined as


$$\begin{aligned} h(x_{j},v_i)=\sum _{k=1}^{m} {\mathbb M}(x_{j}[k],v_{i}[k]) \end{aligned}$$

(3.2)
where $${\mathbb M}(x_{j}[k],v_{i}[k])$$ can be obtained from an amino acid mutation matrix through a table look-up method. Note that $$x_{j}[k],v_{i}[k] \in {\mathbb A}$$ and $${\mathbb A}$$ is a set of 20 amino acids (Table 3.1).

Consider two bio-basis strings $$v_1=\text{ KPRT }$$ and $$v_2=\text{ YKAE }$$, and a subsequence $$x_1=\text{ IPRS }$$ having $$m=4$$ residues. The nongapped pairwise homology score is calculated between the subsequence $$x_1$$ and each bio-basis string considering the mutation probabilities as in Table 3.1. For the first bio-basis string $$v_1$$, four mutation probabilities are


$$\begin{aligned}&{\mathbb M}(x_{1}[1],v_{1}[1])={\mathbb M}(\text{ I, } \text{ K })=24;~~ {\mathbb M}(x_{1}[2],v_{1}[2])={\mathbb M}(\text{ P, } \text{ P })=56;\\&{\mathbb M}(x_{1}[3],v_{1}[3])={\mathbb M}(\text{ R, } \text{ R })=56;~~ {\mathbb M}(x_{1}[4],v_{1}[4])={\mathbb M}(\text{ S, } \text{ T })=36. \end{aligned}$$
Hence, the homology score between the subsequence $$x_1$$ and the bio-basis string $$v_1$$ is given by


$$\begin{aligned} h(x_1,v_1)=\sum _{k=1}^{4} {\mathbb M}(x_{1}[k],v_{1}[k])=172. \end{aligned}$$
Similarly, for the second bio-basis string $$v_2$$, four mutation probabilities are 28, 28, 24, and 32. Thus, the value of $$h(x_1,v_2)$$ between the subsequence $$x_1$$ and the bio-basis string $$v_2$$ is as follows:


$$\begin{aligned} h(x_1,v_2)=\sum _{k=1}^{4} {\mathbb M}(x_{1}[k],v_{2}[k])=112. \end{aligned}$$
The maximum homology scores of two bio-basis strings $$v_1$$ and $$v_2$$ are given by


$$\begin{aligned} h(v_1,v_1)=208~~~\mathrm{{and}}~~~h(v_2,v_2)=212. \end{aligned}$$
Considering the value of $$\gamma _b=1$$


$$\begin{aligned} f(x_1,v_1)&=\mathrm{{exp}} \left\{ \gamma _b \frac{h(x_1,v_1)-h(v_1,v_1)}{h(v_1,v_1)} \right\} =0.633334;\\ f(v_1,v_1)&=\mathrm{{exp}} \left\{ \gamma _b \frac{h(v_1,v_1)-h(v_1,v_1)}{h(v_1,v_1)} \right\} =1.000000;\\ f(x_1,v_2)&=\mathrm{{exp}} \left\{ \gamma _b \frac{h(x_1,v_2)-h(v_2,v_2)}{h(v_2,v_2)} \right\} =0.287988;\\ f(v_2,v_2)&=\mathrm{{exp}} \left\{ \gamma _b \frac{h(v_2,v_2)-h(v_2,v_2)}{h(v_2,v_2)} \right\} =1.000000. \end{aligned}$$
Hence, the value of BBF $$f(x_i,x_j)$$ is high if two subsequences $$x_i$$ and $$x_j$$ are similar or close to each other. The function value is one if two subsequences are identical, and small if they are distinct. The function needs a subsequence as a support (bio-basis string). Each bio-basis string is a feature dimension in a numerical feature space. If $${\mathbb A}$$ is used to denote a collection of 20 amino acids, an input space of all potential subsequences with $$m$$ residues is $${\mathbb A}^m$$. Then, a collection of $$c$$ bio-basis strings formulates a numerical feature space $${\mathbb R}^c$$, to which a nonnumerical sequence space $${\mathbb A}^m$$ is mapped for analysis. More importantly, the BBF can transform various homology scores to a real number as a similarity within the interval $$[0,1]$$, that is,


$$\begin{aligned} 0 \le f(x_j,v_i) \le 1. \end{aligned}$$

(3.3)
After the mapping using BBF, a nonnumerical subsequence space $${\mathbb A}^m$$ will be mapped to a $$c$$-dimensional numerical feature space $${\mathbb R}^c$$, that is, $${\mathbb A}^m \rightarrow {\mathbb R}^c$$.


3.2.2 Selection of Bio-Basis Strings Using Mutual Information


In [41], Yang and Thomson proposed a method for bio-basis string selection using mutual information [32]. The necessity for a bio-basis string to be an independent and informative feature can be determined by the shared information between the bio-basis string and the rest as well as the shared information between the bio-basis string and class label [41].

The mutual information is quantified as the difference between the initial uncertainty and the conditional uncertainty. Let $$\Phi = \{v_i\}$$ be a set of selected bio-basis strings, $$\Theta = \{v_k\}$$ a set of candidate bio-basis strings. $$\Phi = \phi $$ (empty) at the beginning. A prior probability of a bio-basis string $$v_k$$ is referred as $$p(v_k)$$. The initial uncertainty of $$v_k$$ is defined as


$$\begin{aligned} H(v_k) = -p(v_k)\ln p(v_k). \end{aligned}$$

(3.4)
Similarly, the joint entropy of two bio-basis strings $$v_k$$ and $$v_i$$ is given by


$$\begin{aligned} H(v_k,v_i)=-p(v_k,v_i)\ln p(v_k,v_i) \end{aligned}$$

(3.5)
where $$v_i \in \Phi $$ and $$v_k\in \Theta $$. The mutual information between $$v_k$$ and $$v_i$$ is, therefore, given by


$$\begin{aligned} I(v_k,v_i) =&H(v_k) + H(v_i) - H(v_k,v_i)\nonumber \\ =&\{-p(v_k)\ln p(v_k)-p(v_i)\ln p(v_i) \nonumber \\&+ p(v_k,v_i)\ln p(v_k,v_i)\}. \end{aligned}$$

(3.6)
However, the mutual information of $$v_k$$ with respect to all the bio-basis strings in $$\Phi $$ is


$$\begin{aligned} I(v_k,\Phi )=\sum _{v_i \in \Phi } I(v_k,v_i). \end{aligned}$$

(3.7)
Combining (3.6) and (3.7), we get [41]


$$\begin{aligned} I(v_k,\Phi )=\sum _{v_i \in \Phi } p(v_k,v_i)\ln \left\{ \frac{p(v_k,v_i)}{p(v_k)p(v_i)} \right\} . \end{aligned}$$

(3.8)
Replacing $$\Phi $$ with the class label $$\Omega =\{\Omega _1,\ldots , \Omega _j, \ldots , \Omega _M\}$$, the mutual information


$$\begin{aligned} I(v_k,\Omega )=\sum _{\Omega _j \in \Omega } p(v_k,\Omega _j)\ln \left\{ \frac{p(v_k,\Omega _j)}{p(v_k)p(\Omega _j)}\right\} \end{aligned}$$

(3.9)
measures the mutual relationship between $$v_k$$ and $$\Omega $$. A bio-basis string whose $$I(v_k,\Omega )$$ value is the largest will be selected as $$v_k$$ and will make the largest contribution to modeling (discrimination using $$\Omega $$) among all the remaining bio-basis strings in $$\Theta $$. Therefore, there are two mutual information measurements for $$v_k$$, the mutual information between $$v_k$$ and $$\Omega $$ ($$I(v_k,\Omega )$$) and the mutual information between $$v_k$$ and $$\Phi $$ ($$I(v_k,\Phi )$$). In this method, the following criterion is used for the selection of bio-basis strings [38, 41]


$$\begin{aligned} J(v_k)=\alpha _{\mathrm{{YT}}} I(v_k,\Omega )-(1-\alpha _{\mathrm{{YT}}})I(v_k,\Phi ) \end{aligned}$$

(3.10)
where $$\alpha _{\mathrm{{YT}}}$$ is a constant. In the current study, the value of $$\alpha _{\mathrm{{YT}}}$$ is set at 0.7 to give more weightage in discrimination [38, 41]. The major drawback of the method proposed by Yang and Thomson [41] is a huge number of prior and joint probabilities are to be calculated, which makes the method computationally expensive.


3.2.3 Selection of Bio-Basis Strings Using Fisher Ratio


In [8], Berry et al. proposed a method to select a set $$V=\{v_1,\ldots , v_i, \ldots , v_c\}$$ of $$c$$ bio-basis strings from the whole set $$X=\{x_1, \ldots , x_j, \ldots , x_n\}$$ of $$n$$ subsequences based on their discriminant capability. The discriminant capability of each subsequence is calculated using the Fisher ratio [14]. The Fisher ratio is used to maximize the discriminant capability of a subsequence in terms of interclass separation (as large as possible) and intraclass spread between subsequences (as small as possible). The larger the Fisher ratio value, the larger the discriminant capability of the subsequence. Based on the values of the Fisher ratio, $$n$$ subsequences of $$X$$ can be ranked from the strongest discriminant capability to the weakest one. The method yields a set $$V$$ of $$c$$ subsequences from $$X$$ as the bio-basis strings which possess good discriminant capability between two classes, having evolved from original data set.

However, the $$n$$ subsequences of $$X$$ would have different compositions of amino acids. Hence, they should have different pairwise alignment scores with the other subsequences of $$X$$. As the class properties of these training subsequences are known, these similarity values can be partitioned into two groups or classes (functional and nonfunctional), which are denoted as $$X_A \subset X$$ and $$X_B \subset X$$, respectively. Denoting the similarity between two subsequences $$x_i$$ and $$x_j$$ as $$h(x_j,x_i)$$, the mean and standard deviation values for these two groups with respect to the subsequence $$x_i$$ are as follows:


$$\begin{aligned} U_{A_i}&= E_A[h(x_j,x_i)]=\frac{1}{n_A} \sum h(x_j,x_i); \qquad \forall x_j \in X_A\end{aligned}$$

(3.11)



$$\begin{aligned} U_{B_i}&= E_B[h(x_k,x_i)]=\frac{1}{n_B} \sum h(x_k,x_i); \qquad \forall x_k \in X_B \end{aligned}$$

(3.12)



$$\begin{aligned} \sigma _{A_i}^2&= E_A[h^2(x_j,x_i)]-[E_A[h(x_j,x_i)]]^2 \nonumber \\&=\frac{1}{n_A} \sum \{h(x_j,x_i)-U_{A_i}\}^2;~~\forall x_j \in X_A \end{aligned}$$

(3.13)



$$\begin{aligned} \sigma _{B_i}^2&= E_B[h^2(x_k,x_i)]-[E_B[h(x_k,x_i)]]^2 \nonumber \\&=\frac{1}{n_B} \sum \{h(x_k,x_i)-U_{B_i}\}^2;~~\forall x_k \in X_B \end{aligned}$$

(3.14)
where $$n_A$$ and $$n_B$$ are the number of similarity values in $$X_A$$ and $$X_B$$, respectively. $$E[h(x_j,x_i)]$$ and $$E[h^2(x_k,x_i)]$$ represent the zero-mean, first and second order moment of similarity, that is, expectation of $$h(x_j,x_i)$$ and $$h^2(x_j,x_i)$$, respectively. Based on these four quantities, the discriminant capability of each subsequence can be measured using the Fisher ratio


$$\begin{aligned} F(x_i)= \frac{|U_{A_i}-U_{B_i}|}{\sqrt{\sigma _{A_i}^2+\sigma _{B_i}^2}} \end{aligned}$$

(3.15)
where


$$\begin{aligned} |U_{A_i}-U_{B_i}|= |E_A[h(x_j,x_i)]-E_B[h(x_k,x_i)]|; \end{aligned}$$

(3.16)
and


$$\begin{aligned} \sigma _{A_i}^2+\sigma _{B_i}^2 = \{E_A[h^2(x_j,x_i)]+E_B[h^2(x_k,x_i)]\} -\{[E_A[h(x_j,x_i)]]^2+[E_B[h(x_k,x_i)]]^2\}. \end{aligned}$$

(3.17)
The basic steps of this method follows next:

1.

Calculate the discriminant capabilities of all subsequences of $$X$$ using the Fisher ratio as in (3.15).

 

2.

Rank all subsequences of $$X$$ based on the values of Fisher ratio in descending order.

 

3.

Select first $$c$$ subsequences from $$X$$ as the set $$V$$ of bio-basis strings.

 
However, the bio-basis strings in nonnumerical sequence space should be such that the similarity between pairs of bio-basis strings would be as minimum as possible. Each of them would then represent a unique feature in numerical feature space. The methods proposed in [8, 41] have not adequately addressed this problem. Also, not much attention has been paid to it earlier.


3.3 Novel String Kernel Function


In this section, a novel string kernel function is presented [21] based on the concepts of biological dissimilarity and zone of influence of bio-basis string. Next, an efficient method is reported for selection of bio-basis strings integrating the Fisher ratio and the principle of degree of resemblance.


3.3.1 Asymmetricity of Biological Dissimilarity


Here, we define two asymmetric dissimilarities between two subsequences $$x_i$$ and $$x_j$$ as follows [21]:


$$\begin{aligned}&d_{x_i \rightarrow x_j}=d(x_j,x_i)=\{h(x_i,x_i)-h(x_j,x_i)\}\nonumber \\&d_{x_j \rightarrow x_i}=d(x_i,x_j)=\{h(x_j,x_j)-h(x_i,x_j)\} \end{aligned}$$

(3.18)
where $$d_{x_i \rightarrow x_j}$$ denotes the dissimilarity of subsequence $$x_j$$ from the subsequence $$x_i$$ and $$h(x_i,x_j)=h(x_j,x_i)$$ is the nongapped homology score between two subsequences $$x_i$$ and $$x_{j}$$.

Consider two subsequences $$x_i=\text{ KPRT }$$ and $$x_j=\text{ YKAE }$$ with 4 residues. According to the Dayhoff mutation matrix (Table 3.1), the nongapped pairwise homology score between two subsequences $$x_i$$ and $$x_{j}$$ is, therefore, $$h(x_i,x_j)$$ = $$h(x_j,x_i)$$ = 100, while the maximum homology scores of two subsequences $$x_i$$ and $$x_j$$ are given by $$h(x_i,x_i)=208$$ and $$h(x_j,x_j)=212$$, respectively. Hence, the dissimilarity of subsequence $$x_j$$ from subsequence $$x_i$$ is given by


$$\begin{aligned} d(x_j,x_i)=\{h(x_i,x_i)-h(x_j,x_i)\}=208-100=108, \end{aligned}$$
whereas the dissimilarity of $$x_i$$ from $$x_j$$ is as follows:


$$\begin{aligned} d(x_i,x_j)=\{h(x_j,x_j)-h(x_i,x_j)\}=212-100=112. \end{aligned}$$
Thus, the dissimilarity is asymmetric in nature, that is,


$$\begin{aligned} d(x_j,x_i) \ne d(x_i,x_j). \end{aligned}$$

(3.19)
The asymmetricity reflects domain organizations of two subsequences $$x_i$$ and $$x_j$$. When two subsequences $$x_i$$ and $$x_j$$ consist of the same single domain, $$d(x_j,x_i)$$ and $$d(x_i,x_j)$$ will be similar small values. However, suppose that $$x_i$$ has one extra domain, then $$d(x_j,x_i)$$ becomes large even if $$d(x_i,x_j)$$ is small. These dissimilarities may be used for clustering of protein sequences or subsequences so that domain organizations are well reflected. The asymmetric property of the biological dissimilarity was also observed by Stojmirovic [33] and Itoh et al. [16]. The asymmetric dissimilarity might be a powerful tool to cluster sequences or subsequences and to explore gene/protein universe.


3.3.2 Novel Bio-Basis Function


The design of novel string kernel function is based on the principle of asymmetric biological dissimilarity [21]. Using a table look-up technique, a biological dissimilarity is calculated for a pair of subsequences based on an amino acid mutation matrix. The nongapped pairwise alignment method is used to calculate this dissimilarity, where no deletion or insertion is used to align two subsequences. The definition of the novel bio-basis function (nBBF) is as follows [21]:


$$\begin{aligned} f_{\text {novel}}(x_j,v_i)&={\text {exp}}\left\{ \frac{\{h(x_j,v_i) - h(v_i,v_i)\}}{\eta _i}\right\} \nonumber \\ {\text {that is}},\qquad \qquad \qquad \qquad ~~f_{\text {novel}}(x_j,v_i)&=\mathrm{{exp}}\left\{ \frac{-d(x_j,v_i)}{\eta _i}\right\} \end{aligned}$$

(3.20)
The parameter $$\eta _i$$ in (3.20) represents the zone of influence of the $$i$$th bio-basis string $$v_i$$. It represents the variance of the bio-basis string $$v_i$$ with respect to the subsequences nearest to it. In other words, if each bio-basis string is considered as a cluster prototype, then the zone of influence of it represents the radius of that cluster. The value of $$\eta _i$$ could be the same for all bio-basis strings if all of them are expected to form similar clusters in nonnumerical sequence space. In general, it is desirable that $$\eta _i$$ should relate to the overall size and shape of the cluster associated with the bio-basis string $$v_i$$. In the present research work, the following definition is used:


$$\begin{aligned} \eta _i = \frac{1}{n_i}\sum _{x_j} d(x_j,v_i)= \frac{1}{n_i}\sum _{x_j} {\{h(v_i,v_i)-h(x_j,v_i)\}} \end{aligned}$$

(3.21)
where $$n_i$$ is the total number of subsequences having minimum dissimilarity from the $$i$$th bio-basis string $$v_i$$ among all the bio-basis strings and $$\{h(v_i,v_i)-h(x_j,v_i)\}$$ is the dissimilarity of the subsequence $$x_j$$ from the $$i$$th bio-basis string $$v_i$$. In other words, the value of $$\eta _i$$ represents the average dissimilarity of the input subsequences from their nearest bio-basis string $$v_i$$.

Hence, the novel string kernel function normalizes the asymmetric dissimilarity using the zone of influence or variance of the bio-basis string, rather than using maximum homology score of the bio-basis string as in (3.1).


3.4 Biological Dissimilarity Based String Selection Method


One of the main problem in BBF is how to select a reduced set of most relevant bio-basis strings. The bio-basis strings are the sections of biological sequences that are used for the transformation of biological data into a numerical feature space. Hence, the problem of selecting a set $$V=\{v_1,\ldots , v_i, \ldots , v_c\}$$ of $$c$$ subsequences as the bio-basis strings from the whole set $$X=\{x_1,\ldots , x_j,\ldots , x_n\}$$ of $$n$$ subsequences, where $$V \subset X$$, is a feature selection problem.

In real biological data analysis, the data set may contain a number of similar or redundant subsequences with low discriminant capability or relevance to the classes. The selection of such similar and nonrelevant subsequences as the bio-basis strings may lead to a reduction in the useful information in numerical feature space. Ideally, the selected bio-basis strings should have high discriminant capability with the classes while the similarity among them would be as low as possible. The subsequences with high discriminant capability are expected to be able to predict the classes of the subsequences. However, the prediction capability may be reduced if many similar subsequences are selected as the bio-basis strings. In contrast, a data set that contains subsequences not only with high relevance with respect to the classes but with low mutual redundancy is more effective in its prediction capability. Hence, to assess the effectiveness of the subsequences as the bio-basis strings, both the relevance and the redundancy or similarity need to be measured quantitatively. The bio-basis string selection method, reported in [21], addresses the above issues through following three phases:

1.

computation of the discriminant capability or relevance of each subsequence;

 

2.

determination of the nonrelevant subsequences; and

 

3.

computation of the similarity or redundancy among subsequences.

 
An asymmetric biological dissimilarity based Fisher ratio is chosen here to compute the discriminant capability or relevance of each subsequence to the classes, while a novel concept of the degree of resemblance is used to calculate the mutual redundancy or similarity among subsequences. The nonrelevant subsequences are discarded using a nearest mean classifier [14]. Next the calculation of the Fisher ratio using asymmetric biological dissimilarity is provided along with the concept of degree of resemblance and the principle of nearest mean classifier.


3.4.1 Fisher Ratio Using Biological Dissimilarity


In the method reported in [21], the Fisher ratio [14] is used to measure the discriminant capability or relevance of each subsequence $$x_i \in X$$. The Fisher ratio is calculated based on the asymmetric biological dissimilarity. As the class labels of all training subsequences are known, the set $$X$$ can be partitioned into two groups or classes $$X_A$$ and $$X_B$$, respectively, where


$$\begin{aligned}&X_A \cap X_B = \emptyset ; \qquad X_A \cup X_B = X;\end{aligned}$$

(3.22)



$$\begin{aligned}&|X_A|=n_A; \qquad \,\, |X_B|=n_B; \qquad n_A + n_B = n. \end{aligned}$$

(3.23)
Hence, each subsequence $$x_i \in X$$ should have $$n_A$$ and $$n_B$$ dissimilarity values with the subsequences of $$X_A$$ and $$X_B$$, respectively. Denoting the dissimilarity of the subsequence $$x_j$$ from the subsequence $$x_i$$ as $$d(x_j,x_i)$$, the mean and standard deviation values for the two classes $$X_A$$ and $$X_B$$ with respect to the subsequence $$x_i$$ are as follows:


$$\begin{aligned} \bar{U}_{A_i}&=\frac{1}{n_A}\sum d^2(x_j,x_i); \qquad \forall x_j \in X_A\end{aligned}$$

(3.24)



$$\begin{aligned} \bar{U}_{B_i}&=\frac{1}{n_B}\sum d^2(x_k,x_i); \qquad \forall x_k \in X_B \end{aligned}$$

(3.25)



$$\begin{aligned} \bar{\sigma }_{A_i}^2&=\frac{1}{n_A}\sum \{d^2(x_j,x_i)-\bar{U}_{A_i}\}^2; \qquad \forall x_j \in X_A\end{aligned}$$

(3.26)



$$\begin{aligned} \bar{\sigma }_{B_i}^2&= \frac{1}{n_B}\sum \{d^2(x_k,x_i)- \bar{U}_{B_i}\}^2; \qquad \forall x_k \in X_B \end{aligned}$$

(3.27)
where $$\bar{U}_{A_i}$$, $$\bar{U}_{B_i}$$, $$\bar{\sigma }_{A_i}$$, and $$\bar{\sigma }_{B_i}$$ represent the mean and standard deviation values of the subsequence $$x_i$$ for two groups $$X_A$$ and $$X_B$$, respectively. These four quantities are calculated based on the square of biological dissimilarity with respect to the subsequence $$x_i$$. Based on these four quantities, the discriminant capability of each subsequence $$x_i$$ is computed using the Fisher ratio that is as follows:


$$\begin{aligned} \bar{\mathbb F}(x_i)=\frac{|\bar{U}_{A_i}-\bar{U}_{B_i}|}{\sqrt{\bar{\sigma }_{A_i}^2+\bar{\sigma }_{B_i}^2}}. \end{aligned}$$

(3.28)
Let $$\kappa _i=h(x_i,x_i)$$ be the maximum homology score of the subsequence $$x_i$$. The above four quantities can then be written using $$\kappa _i$$ as


$$\begin{aligned} \bar{U}_{A_i}&=\{\kappa _i^2 + E_A[h^2(x_j,x_i)]- 2\kappa _i E_A[h(x_j,x_i)]\};\end{aligned}$$

(3.29)



$$\begin{aligned} \bar{U}_{B_i}&=\{\kappa _i^2 + E_B[h^2(x_k,x_i)]- 2\kappa _i E_B[h(x_k,x_i)]\}; \end{aligned}$$

(3.30)



$$\begin{aligned} \bar{\sigma }_{A_i}^2&= \{4\kappa _i^2(E_A[h^2(x_j,x_i)]-[E_A[h(x_j,x_i)]]^2)\nonumber \\&\quad -\,4\kappa _i(E_A[h^3(x_j,x_i)]-E_A[h(x_j,x_i)]E_A[h^2(x_j,x_i)])\nonumber \\&\quad -\,[E_A[h^2(x_j,x_i)]]^2+E_A[h^4(x_j,x_i)] \}; \end{aligned}$$

(3.31)
and


$$\begin{aligned} \bar{\sigma }_{B_i}^2&= \{4\kappa _i^2(E_B[h^2(x_k,x_i)]-[E_B[h(x_k,x_i)]]^2)\nonumber \\&\quad -\,4\kappa _i(E_B[h^3(x_k,x_i)]-E_B[h(x_k,x_i)]E_B[h^2(x_k,x_i)])\nonumber \\&\quad -\,[E_B[h^2(x_k,x_i)]]^2+E_B[h^4(x_k,x_i)] \}; \end{aligned}$$

(3.32)
where $$E[h^r(x_j,x_i)]$$ represents the zero-mean, $$r$$th order moment of similarity $$h(x_j,x_i)$$ between two subsequences $$x_i$$ and $$x_j$$. Now, the numerator of (3.28) is given by


$$\begin{aligned} |\bar{U}_{A_i}-\bar{U}_{B_i}|&= |\{E_A[h^2(x_j,x_i)]-E_B[h^2(x_k,x_i)]\} \nonumber \\&\quad -2 \kappa _i \{E_A[h(x_j,x_i)]-E_B[h(x_k,x_i)]\}|. \end{aligned}$$

(3.33)
Hence, the numerator of (3.28) not only depends on the difference of zero-mean, first-order moment of similarity of two groups as in (3.15), it also takes into account the difference of zero-mean, second-order moment of similarity as well as the maximum homology score of the subsequence $$x_i$$. That is, the numerator of (3.28) depends on following three factors:



  • difference of zero-mean, first order moment of similarity of two groups, $$\{E_A[h(x_j,x_i)]-E_B[h(x_k,x_i)]\}$$;


  • difference of zero-mean, second-order moment of similarity of two groups, $$\{E_A[h^2(x_j,x_i)]-E_B[h^2(x_k,x_i)]\}$$;


  • maximum homology score of the subsequence $$x_i$$, that is, $$\kappa _i=h(x_i,x_i)$$.
Similarly, the denominator of (3.28) contains the following terms:


$$\begin{aligned} \bar{\sigma }_{A_i}^2+\bar{\sigma }_{B_i}^2&= [4\kappa _i^2\{E_A[h^2(x_j,x_i)]+E_B[h^2(x_k,x_i)]\}\nonumber \\&\quad -4\kappa _i^2\{[E_A[h(x_j,x_i)]]^2+[E_B[h(x_k,x_i)]]^2\}\nonumber \\&\quad -4\kappa _i\{E_A[h^3(x_j,x_i)]+E_B[h^3(x_k,x_i)]\}\nonumber \\&\quad +4\kappa _i\{E_A[h(x_j,x_i)]E_A[h^2(x_j,x_i)]\nonumber \\&\quad +E_B[h(x_k,x_i)]E_B[h^2(x_k,x_i)]\}\nonumber \\&\quad -\{[E_A[h^2(x_j,x_i)]]^2+[E_B[h^2(x_k,x_i)]]^2\}\nonumber \\&\quad +\{E_A[h^4(x_j,x_i)]+E_B[h^4(x_k,x_i)]\}]. \end{aligned}$$

(3.34)
Hence, the denominator of (3.28) considers the zero-mean, higher order (upto fourth order) moment of similarity of two groups as well as the maximum homology score of the subsequence $$x_i$$, while that of (3.15) only takes into account the zero-mean, first- and second-order moment of similarity of two groups and does not consider the maximum homology score of the subsequence $$x_i$$. In effect, (3.28) calculates the discriminant capability of each subsequence $$x_i$$ more accurately.


3.4.2 Nearest Mean Classifier


After computing the discriminant capability or relevance $$\bar{\mathbb {F}}(x_i)$$ of each subsequence $$x_i \in X$$ using the Fisher ratio according to (3.28), the nonrelevant subsequences are discarded based on a threshold value $$\delta $$. The subsequences those have the Fisher ratio values larger than or equal to the threshold value $$\delta $$ are considered as the candidate bio-basis strings. The value of $$\delta $$ is obtained using the concept of nearest mean classifier.

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

May 25, 2017 | Posted by in GENERAL & FAMILY MEDICINE | Comments Off on Design of String Kernel to Predict Protein Functional Sites Using Kernel-Based Classifiers

Full access? Get Clinical Tree

Get Clinical Tree app for offline access