Rough Sets for Insilico Identification of Differentially Expressed miRNAs

-information [17] based minimum redundancy-maximum relevance framework , reported in Chap. 5, can also be used to select a set of nonredundant and relevant miRNAs for sample classification. A detailed survey on several feature selection algorithms is reported in Chap. 4.


One of the main problems in miRNA expression data analysis is uncertainty. Some of the sources of this uncertainty include imprecision in computations and vagueness in class definition. In this background, the rough set theory has gained popularity in modeling and propagating uncertainty. It deals with vagueness and incompleteness and is proposed for indiscernibility in classification according to some similarity [35]. A brief survey on different rough set-based feature selection algorithms is reported in Chap. 4. The theory of rough sets has also been successfully applied to microarray data analysis in [8, 18, 21, 2325, 31, 32, 39, 40].

In general, the performance of the prediction rule generated by a classifier for a subset of selected miRNAs is evaluated by leave-one-out cross-validation (LOOCV) error. Given that the entire set of available samples is relatively small, in practice, one would like to make full use of all available samples in the miRNA selection and training of the prediction rule. But, if the LOOCV is calculated within the miRNA selection process, there is a selection bias in it when it is used as an estimate of the prediction error. The LOOCV error of the prediction rule obtained during the selection of the miRNAs provides a too optimistic estimate of the prediction error rate. Hence, an external cross-validation should be undertaken subsequent to the miRNA selection process to correct for this selection bias. Alternatively, the bootstrap procedure can be used [7, 19].

Although, the LOOCV error with external cross-validation is nearly unbiased, it can be highly variable in the sense that there is no guarantee that the same subset of miRNAs will be obtained as during the original training of the rule on all the training samples. Indeed, with the huge number of miRNAs available, it generally will yield a subset of miRNAs that has at most only a few miRNAs in common with the subset selected during the original training of the rule. Suitably defined bootstrap procedures can reduce the variability of the LOOCV error in addition to providing a direct assessment of variability for estimated parameters in the prediction rule. However, the bootstrap approach overestimates the error. To reduce the weakness of both these approaches, Efron and Tibshirani introduced the concept of $$B.632+$$ error for correcting the upward bias in bootstrap error with the downwardly biased apparent error [7], which is very much applicable for the data sets with small number of training samples and large number of miRNAs.

In this regard, this chapter presents a novel approach, proposed by Paul and Maji in [34], for insilico identification of differentially expressed miRNAs from expression data sets. It integrates the merit of rough set-based feature selection algorithm using maximum relevance-maximum significance criterion (RSMRMS), reported in Chap. 4, and the concept of so-called $$B.632+$$ error rate [7]. The RSMRMS algorithm selects a subset of miRNAs from a data set by maximizing both relevance and significance of the selected miRNAs. It employs rough set theory to compute both relevance and significance of the miRNAs. Hence, the only information required in the feature selection method is in the form of equivalence partitions for each miRNA, which can be automatically derived from the given microarray data set. A fuzzy set-based discretization method is presented to generate equivalence classes required to compute both relevance and significance of miRNAs using rough set theory. This avoids the need for domain experts to provide information on the data involved and ties in with the advantage of rough sets is that it requires no information other than the data set itself. On the other hand, the $$B.632+$$ error rate minimizes the variability and biasedness of the derived results. The support vector machine is used to compute the $$B.632+$$ error rate as well as several other types of error rates as it maximizes the margin between data samples in different classes. The effectiveness of the new approach, along with a comparison with other related approaches, is demonstrated on a set of miRNA expression data sets.

The chapter is organized as follows: Sect. 7.2 presents the miRNA selection method reported in [34], which covers the basics of the RSMRMS algorithm, and the concepts of fuzzy discretization and $$B.632+$$ error rate. Implementation details, a brief description of several miRNA data sets used in this study, experimental results, and a comparison among different algorithms are presented in Sect. 7.3. Concluding remarks are given in Sect. 7.4.

A319338_1_En_7_Fig1_HTML.gif


Fig. 7.1
Schematic flow diagram of the insilico approach for identification of differentially expressed miRNAs



7.2 Selection of Differentially Expressed miRNAs


The rough set-based insilico approach is illustrated in Fig. 7.1. It mainly consists of rough set-based feature selection method (RSMRMS) described in Chap. 4, support vector machine (SVM) [41], and several types of error analysis parts, namely, apparent error ($$AE$$), bootstrap error ($$B1$$), no-information error ($$\gamma $$), and $$B.632+$$ error. The RSMRMS algorithm selects a set of miRNAs from a given miRNA expression data. The selected set of miRNAs is then used to design the SVM classifier, and the effectiveness of the build up SVM classifier is further tested by using unseen data. In order to calculate $$B.632+$$ error, at first, apparent error ($$AE$$) is calculated. This error is generated, when the same data set is used to train and test a classifier. Next, $$B1$$ error is calculated from $$k$$ bootstrap samples. Finally, by randomly perturbing the class label of a given data set, no-information error ($$\gamma $$) is calculated. The mutated data set is used for miRNA selection and the generated set of miRNAs is used to build the SVM. Then, the trained SVM is tested using the original data set. The error generated by this procedure is known as no-information error ($$\gamma $$). Using apparent error ($$AE$$), $$B1$$ error, and $$\gamma $$ error, lastly $$B.632+$$ error is calculated. The RSMRMS method is discussed in Chap. 4, while a brief introduction of the SVM is reported in Chaps. 3 and 4. Hence, this section presents only the concepts of fuzzy equivalence classes used to generate equivalence classes for rough sets and different types of errors, along with a brief overview of the RSMRMS algorithm.


7.2.1 RSMRMS Algorithm


In real data analysis such as microarray data, the data set may contain a number of insignificant features. The presence of such irrelevant and insignificant features may lead to a reduction in the useful information. Ideally, the selected features should have high relevance with the classes and high significance in the feature set. The features with high relevance are expected to be able to predict the classes of the samples. However, if insignificant features are present in the subset, they may reduce the prediction capability and may contain similar biological information. A feature set with high relevance and high significance enhances the predictive capability. Accordingly, a measure is required that can enhance the effectiveness of feature set. In this work, the rough set theory is used to select the relevant and significant miRNAs from high dimensional microarray data sets.

Let $${\mathbb {C}}=\{{\fancyscript{A}}_1,\cdots , {\fancyscript{A}}_i,\cdots ,{\fancyscript{A}}_j, \cdots ,{\fancyscript{A}}_m\}$$ be the set of $$m$$ miRNAs of a given microarray data set and $${\mathbb {S}}$$ is the set of selected miRNAs. Define $$\gamma _{{\fancyscript{A}}_i} ({\mathbb {D}})$$ as the relevance of the miRNA $${\fancyscript{A}}_i$$ with respect to the class labels $${\mathbb {D}}$$ while $$\sigma _{\{{\fancyscript{A}}_i,{\fancyscript{A}}_j\}}({\mathbb {D}},{\fancyscript{A}}_j)$$ as the significance of the miRNA $$\fancyscript{A}_j$$ with respect to the set $$\{{\fancyscript{A}}_i,{\fancyscript{A}}_j\}$$. The total relevance of all selected miRNAs is as follows:


$$\begin{aligned} {\fancyscript{J}}_\mathrm{relev}= \sum _{{\fancyscript{A}}_i \in {\mathbb {S}}} \gamma _{{\fancyscript{A}}_i} ({\mathbb {D}}) \end{aligned}$$

(7.1)
while the total significance among the selected miRNAs is


$$\begin{aligned} {\fancyscript{J}}_\mathrm{signf}= \sum _{{\fancyscript{A}}_i \ne {\fancyscript{A}}_j \in {\mathbb {S}}} \sigma _{\{{\fancyscript{A}}_i, {\fancyscript{A}}_j\}}({\mathbb {D}},{\fancyscript{A}}_j). \end{aligned}$$

(7.2)
Therefore, the problem of selecting a set $${\mathbb {S}}$$ of relevant and significant miRNAs from the whole set $${\mathbb {C}}$$ of $$m$$ miRNAs is equivalent to maximize both $${\fancyscript{J}}_\mathrm{relev}$$ and $${\fancyscript{J}}_\mathrm{signf}$$, that is, to maximize the objective function $${\fancyscript{J}}$$, where


$$\begin{aligned} {\fancyscript{J}}= {\fancyscript{J}}_\mathrm{relev}+ \beta {\fancyscript{J}}_\mathrm{signf} \end{aligned}$$

(7.3)
that is,


$$\begin{aligned} {\fancyscript{J}}=\sum _{{\fancyscript{A}}_i \in {\mathbb {S}}} \gamma _{{\fancyscript{A}}_i} ({\mathbb {D}})+ \beta \sum _{{{\fancyscript{A}}_i \ne {\fancyscript{A}}_j \in {\mathbb {S}}}} \sigma _{\{{\fancyscript{A}}_i,{\fancyscript{A}}_j\}}({\mathbb {D}},{\fancyscript{A}}_j) \end{aligned}$$

(7.4)
where $$\beta $$ is a weight parameter. To solve the above problem, a greedy algorithm is used in [24]. The relevance and significance of a miRNA are calculated based on the theory of rough sets using (4.​6) and (4.​7), respectively. The weight parameter $$\beta $$ in the rough set-based MRMS (RSMRMS) algorithm regulates the relative importance of the significance of the candidate miRNA with respect to the already-selected miRNAs and the relevance with the output class. If $$\beta $$ is zero, only the relevance with the output class is considered for each miRNA selection. If $$\beta $$ increases, this measure is incremented by a quantity proportional to the total significance with respect to the already-selected miRNAs. The presence of a $$\beta $$ value larger than zero is crucial in order to obtain good results. If the significance between miRNAs is not taken into account, selecting the miRNAs with the highest relevance with respect to the output class may tend to produce a set of redundant miRNAs that may leave out useful complementary information. Details of the RSMRMS algorithm are available in Chap. 4.


7.2.2 Fuzzy Discretization


In miRNA expression data, the class labels of samples are represented by discrete symbols, while the expression values of miRNAs are continuous. Hence, to measure both relevance and significance of miRNAs using rough set theory, the continuous expression values of a miRNA have to be divided into several discrete partitions to generate equivalence classes. In this regard, a fuzzy set-based discretization method is used to generate equivalence classes required to compute both relevance and significance of the miRNAs.

Fuzzy set was introduced by Zadeh [44] as a generalization of the classical set theory. A fuzzy set $$A$$ in a space of objects $${\mathbb {U}}=\{x_i\}$$ is a class of events with a continuum of grades of membership and is characterized by a membership function $$\mu _A(x_i)$$ that associates with each element in $${\mathbb {U}}$$ a real number in the interval [0, 1] with the value of $$\mu _A(x_i)$$ at $$x_i$$ representing the grade of membership of $$x_i$$ in $$A$$. Formally, a fuzzy set $$A$$ with its finite number of supports $$x_1,\cdots ,x_i,\cdots ,x_n$$ is defined as a collection of ordered pairs $$A = \{\mu _A(x_i)/x_i, i=1,\cdots ,n\}$$, where the support of $$A$$ is an ordinary subset of $${\mathbb {U}}$$ and is defined as


$$\begin{aligned} S(A)=\{x_i|x_i \in {\mathbb {U}}~~\mathrm{and}~~\mu _A(x_i) > 0\}. \end{aligned}$$” src=”http://basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_7_Chapter_Equ5.gif”></DIV></DIV><br />
<DIV class=EquationNumber>(7.5)</DIV></DIV>Here <SPAN id=IEq56 class=InlineEquation><IMG alt= represents the degree to which an object $$x_i$$ may be a member of $$A$$ or belong to $$A$$. If the support of a fuzzy set is only a single object $$x_1 \in {\mathbb {U}}$$, then $$A = \mu _A(x_1)/x_1$$ is called a fuzzy singleton. Hence, if $$\mu _A(x_1)=1$$, $$A=\frac{1}{x_1}$$ denotes a nonfuzzy singleton. In terms of the constituent singletons, the fuzzy set $$A$$ with its finite number of supports $$x_1,\cdots ,x_i,\cdots ,x_n$$ can also be expressed in union form as


$$\begin{aligned} A =\{\mu _A(x_1)/x_1+\cdots +\mu _A(x_i)/x_i+\cdots +\mu _A(x_n)/x_n\} \end{aligned}$$

(7.6)
where the sign + denotes the union [13]. Assignment of membership functions of a fuzzy subset is subjective in nature, and reflects the context in which the problem is viewed.

The family of normal fuzzy sets produced by a fuzzy partitioning of the universe of discourse can play the role of fuzzy equivalence classes. Given a finite set $${\mathbb {U}}$$, $${{\mathbb {C}}}$$ is a fuzzy condition attribute set in $${\mathbb {U}}$$, which generates a fuzzy equivalence partition on $${\mathbb {U}}$$. If $$c$$ denotes the number of fuzzy equivalence classes generated by the fuzzy equivalence relation and $$n$$ is the number of objects in $${\mathbb {U}}$$, then $$c$$-partitions of $${\mathbb {U}}$$ are sets of ($$cn$$) values $$\{\mu _{ij}^{\mathbb {C}}\}$$ that can be conveniently arrayed as a ($$c \times n$$) matrix $${\mathbb {M}}_{{\mathbb {C}}} =[\mu _{ij}^{\mathbb {C}}]$$, which is denoted by


$$\begin{aligned} {\mathbb {M}}_{{\mathbb {C}}}= \left( \begin{array}{llll} \mu _{11}^{\mathbb {C}} &{} \mu _{12}^{\mathbb {C}} &{} \cdots &{} \mu _{1n}^{\mathbb {C}} \\ \mu _{21}^{\mathbb {C}} &{} \mu _{22}^{\mathbb {C}} &{} \cdots &{} \mu _{2n}^{\mathbb {C}} \\ \cdots &{} \cdots &{} \cdots &{} \cdots \\ \mu _{c1}^{\mathbb {C}} &{} \mu _{c2}^{\mathbb {C}} &{} \cdots &{} \mu _{cn}^{\mathbb {C}} \\ \end{array} \right) \end{aligned}$$

(7.7)
where $$\mu _{ij}^{\mathbb {C}} \in [0,1]$$ represents the membership of object $$x_j$$ in the $$i$$th fuzzy equivalence partition or class $$F_i$$ [20, 21].

Each row of the matrix $${\mathbb {M}}_{{\mathbb {C}}}$$ is a fuzzy equivalence partition or class. In the rough set-based feature selection method, the $$\pi $$ function in one dimensional form is used to assign membership values to different fuzzy equivalence classes for the input miRNAs. A fuzzy set with membership function $$\pi (x;\bar{c},\sigma )$$ represents a set of points clustered around $$\bar{c}$$, where


$$\begin{aligned} \pi (x;\bar{c},\sigma ) = \left\{ \begin{array}{ll} 2(1-\frac{||x-\bar{c}||}{\sigma })^2 &{} \, \text {for}\, \frac{\sigma }{2} \le ||x-\bar{c}|| \le \sigma \\ 1-2(\frac{||x-\bar{c}||}{\sigma })^2 &{} \, \text {for} \, 0 \le ||x-\bar{c}|| \le \frac{\sigma }{2} \\ 0 &{}\,\text {otherwise} \end{array} \right. \end{aligned}$$

(7.8)
where $$\sigma > 0$$” src=”http://basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_7_Chapter_IEq87.gif”></SPAN> is the radius of the <SPAN id=IEq88 class=InlineEquation><IMG alt= function with $$\bar{c}$$ as the central point and $$||\cdot ||$$ denotes the Euclidean norm. When the pattern $$x$$ lies at the central point $$\bar{c}$$ of a class, then $$||x-\bar{c}||=0$$ and its membership value is maximum, that is, $$\pi (\bar{c};\bar{c},\sigma )=1$$. The membership value of a point decreases as its distance from the central point $$\bar{c}$$, that is, $$||x-\bar{c}||$$ increases. When $$||x-\bar{c}||=(\frac{\sigma }{2})$$, the membership value of $$x$$ is 0.5 and this is called a crossover point [30]. The $$(c \times n)$$ matrix $${\mathbb {M}}_{{\fancyscript{A}}_i}$$, corresponding to the $$i$$th miRNA $${\fancyscript{A}}_i$$, can be calculated from the $$c$$-fuzzy equivalence classes of the objects $$x=\{x_1,\cdots ,x_j,\cdots ,x_n\}$$, where


$$\begin{aligned} \mu _{kj}^{{\fancyscript{A}}_i}=\frac{\pi (x_j;\bar{c}_k,\sigma _k)}{\displaystyle {\sum _{l=1}^c} \pi (x_j;\bar{c}_l,\sigma _l)}. \end{aligned}$$

(7.9)
In effect, each position $$\mu _{kj}^{{\fancyscript{A}}_i}$$ of the matrix $${\mathbb {M}}_{{\fancyscript{A}}_i}$$ must satisfy the following conditions:


$$\begin{aligned}&\mu _{kj}^{{\fancyscript{A}}_i} \in [0,1];~ \sum _{k=1}^c \mu _{kj}^{{\fancyscript{A}}_i}=1,\forall j~\text {and for any value of}\;k,\\&\mathrm{{if}}\;s=\mathrm{arg}~\max _j\{\mu _{kj}^{{\fancyscript{A}}_i}\},~ \mathrm{then}~\max _j\{\mu _{kj}^{{\fancyscript{A}}_i}\}= \max _l\{\mu _{ls}^{{\fancyscript{A}}_i}\} > 0. \end{aligned}$$” src=”http://basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_7_Chapter_Equ24.gif”></DIV></DIV></DIV>After the generation of the matrix <SPAN id=IEq107 class=InlineEquation><IMG alt= corresponding to the miRNA $${\fancyscript{A}}_i$$, the object $$x_j$$ is assigned to one of the $$c$$ equivalence classes based on the maximum value of memberships of the object in different equivalence classes that follows next:


$$\begin{aligned} x_j \in F_p,\qquad \text {where}~p=\mathrm{arg}~\max _k\{\mu _{kj}^{{\fancyscript{A}}_i}\}. \end{aligned}$$

(7.10)
Each input real-valued miRNA in quantitative form can be assigned to different fuzzy equivalence classes in terms of membership values using the $$\pi $$ fuzzy set with appropriate $$\bar{c}$$ and $$\sigma $$. The centers and radii of the $$\pi $$ functions along each miRNA axis are determined automatically from the distribution of the training patterns. In the RSMRMS algorithm, three fuzzy equivalence classes ($$c=3$$), namely, low, medium, and high are considered. These three equivalence classes correspond to under-expression, baseline, and over-expression of continuous valued miRNAs, respectively. Corresponding to three fuzzy sets low, medium, and high, the following relations hold:
May 25, 2017 | Posted by in GENERAL & FAMILY MEDICINE | Comments Off on Rough Sets for Insilico Identification of Differentially Expressed miRNAs

Full access? Get Clinical Tree

Get Clinical Tree app for offline access