-information [17] based minimum redundancy-maximum relevance framework , reported in Chap. 5, can also be used to select a set of nonredundant and relevant miRNAs for sample classification. A detailed survey on several feature selection algorithms is reported in Chap. 4.
One of the main problems in miRNA expression data analysis is uncertainty. Some of the sources of this uncertainty include imprecision in computations and vagueness in class definition. In this background, the rough set theory has gained popularity in modeling and propagating uncertainty. It deals with vagueness and incompleteness and is proposed for indiscernibility in classification according to some similarity [35]. A brief survey on different rough set-based feature selection algorithms is reported in Chap. 4. The theory of rough sets has also been successfully applied to microarray data analysis in [8, 18, 21, 23–25, 31, 32, 39, 40].
In general, the performance of the prediction rule generated by a classifier for a subset of selected miRNAs is evaluated by leave-one-out cross-validation (LOOCV) error. Given that the entire set of available samples is relatively small, in practice, one would like to make full use of all available samples in the miRNA selection and training of the prediction rule. But, if the LOOCV is calculated within the miRNA selection process, there is a selection bias in it when it is used as an estimate of the prediction error. The LOOCV error of the prediction rule obtained during the selection of the miRNAs provides a too optimistic estimate of the prediction error rate. Hence, an external cross-validation should be undertaken subsequent to the miRNA selection process to correct for this selection bias. Alternatively, the bootstrap procedure can be used [7, 19].
Although, the LOOCV error with external cross-validation is nearly unbiased, it can be highly variable in the sense that there is no guarantee that the same subset of miRNAs will be obtained as during the original training of the rule on all the training samples. Indeed, with the huge number of miRNAs available, it generally will yield a subset of miRNAs that has at most only a few miRNAs in common with the subset selected during the original training of the rule. Suitably defined bootstrap procedures can reduce the variability of the LOOCV error in addition to providing a direct assessment of variability for estimated parameters in the prediction rule. However, the bootstrap approach overestimates the error. To reduce the weakness of both these approaches, Efron and Tibshirani introduced the concept of
error for correcting the upward bias in bootstrap error with the downwardly biased apparent error [7], which is very much applicable for the data sets with small number of training samples and large number of miRNAs.

In this regard, this chapter presents a novel approach, proposed by Paul and Maji in [34], for insilico identification of differentially expressed miRNAs from expression data sets. It integrates the merit of rough set-based feature selection algorithm using maximum relevance-maximum significance criterion (RSMRMS), reported in Chap. 4, and the concept of so-called
error rate [7]. The RSMRMS algorithm selects a subset of miRNAs from a data set by maximizing both relevance and significance of the selected miRNAs. It employs rough set theory to compute both relevance and significance of the miRNAs. Hence, the only information required in the feature selection method is in the form of equivalence partitions for each miRNA, which can be automatically derived from the given microarray data set. A fuzzy set-based discretization method is presented to generate equivalence classes required to compute both relevance and significance of miRNAs using rough set theory. This avoids the need for domain experts to provide information on the data involved and ties in with the advantage of rough sets is that it requires no information other than the data set itself. On the other hand, the
error rate minimizes the variability and biasedness of the derived results. The support vector machine is used to compute the
error rate as well as several other types of error rates as it maximizes the margin between data samples in different classes. The effectiveness of the new approach, along with a comparison with other related approaches, is demonstrated on a set of miRNA expression data sets.



The chapter is organized as follows: Sect. 7.2 presents the miRNA selection method reported in [34], which covers the basics of the RSMRMS algorithm, and the concepts of fuzzy discretization and
error rate. Implementation details, a brief description of several miRNA data sets used in this study, experimental results, and a comparison among different algorithms are presented in Sect. 7.3. Concluding remarks are given in Sect. 7.4.



Fig. 7.1
Schematic flow diagram of the insilico approach for identification of differentially expressed miRNAs
7.2 Selection of Differentially Expressed miRNAs
The rough set-based insilico approach is illustrated in Fig. 7.1. It mainly consists of rough set-based feature selection method (RSMRMS) described in Chap. 4, support vector machine (SVM) [41], and several types of error analysis parts, namely, apparent error (
), bootstrap error (
), no-information error (
), and
error. The RSMRMS algorithm selects a set of miRNAs from a given miRNA expression data. The selected set of miRNAs is then used to design the SVM classifier, and the effectiveness of the build up SVM classifier is further tested by using unseen data. In order to calculate
error, at first, apparent error (
) is calculated. This error is generated, when the same data set is used to train and test a classifier. Next,
error is calculated from
bootstrap samples. Finally, by randomly perturbing the class label of a given data set, no-information error (
) is calculated. The mutated data set is used for miRNA selection and the generated set of miRNAs is used to build the SVM. Then, the trained SVM is tested using the original data set. The error generated by this procedure is known as no-information error (
). Using apparent error (
),
error, and
error, lastly
error is calculated. The RSMRMS method is discussed in Chap. 4, while a brief introduction of the SVM is reported in Chaps. 3 and 4. Hence, this section presents only the concepts of fuzzy equivalence classes used to generate equivalence classes for rough sets and different types of errors, along with a brief overview of the RSMRMS algorithm.














7.2.1 RSMRMS Algorithm
In real data analysis such as microarray data, the data set may contain a number of insignificant features. The presence of such irrelevant and insignificant features may lead to a reduction in the useful information. Ideally, the selected features should have high relevance with the classes and high significance in the feature set. The features with high relevance are expected to be able to predict the classes of the samples. However, if insignificant features are present in the subset, they may reduce the prediction capability and may contain similar biological information. A feature set with high relevance and high significance enhances the predictive capability. Accordingly, a measure is required that can enhance the effectiveness of feature set. In this work, the rough set theory is used to select the relevant and significant miRNAs from high dimensional microarray data sets.
Let
be the set of
miRNAs of a given microarray data set and
is the set of selected miRNAs. Define
as the relevance of the miRNA
with respect to the class labels
while
as the significance of the miRNA
with respect to the set
. The total relevance of all selected miRNAs is as follows:

while the total significance among the selected miRNAs is

Therefore, the problem of selecting a set
of relevant and significant miRNAs from the whole set
of
miRNAs is equivalent to maximize both
and
, that is, to maximize the objective function
, where

that is,

where
is a weight parameter. To solve the above problem, a greedy algorithm is used in [24]. The relevance and significance of a miRNA are calculated based on the theory of rough sets using (4.6) and (4.7), respectively. The weight parameter
in the rough set-based MRMS (RSMRMS) algorithm regulates the relative importance of the significance of the candidate miRNA with respect to the already-selected miRNAs and the relevance with the output class. If
is zero, only the relevance with the output class is considered for each miRNA selection. If
increases, this measure is incremented by a quantity proportional to the total significance with respect to the already-selected miRNAs. The presence of a
value larger than zero is crucial in order to obtain good results. If the significance between miRNAs is not taken into account, selecting the miRNAs with the highest relevance with respect to the output class may tend to produce a set of redundant miRNAs that may leave out useful complementary information. Details of the RSMRMS algorithm are available in Chap. 4.










(7.1)

(7.2)







(7.3)

(7.4)





7.2.2 Fuzzy Discretization
In miRNA expression data, the class labels of samples are represented by discrete symbols, while the expression values of miRNAs are continuous. Hence, to measure both relevance and significance of miRNAs using rough set theory, the continuous expression values of a miRNA have to be divided into several discrete partitions to generate equivalence classes. In this regard, a fuzzy set-based discretization method is used to generate equivalence classes required to compute both relevance and significance of the miRNAs.
Fuzzy set was introduced by Zadeh [44] as a generalization of the classical set theory. A fuzzy set
in a space of objects
is a class of events with a continuum of grades of membership and is characterized by a membership function
that associates with each element in
a real number in the interval [0, 1] with the value of
at
representing the grade of membership of
in
. Formally, a fuzzy set
with its finite number of supports
is defined as a collection of ordered pairs
, where the support of
is an ordinary subset of
and is defined as
represents the degree to which an object
may be a member of
or belong to
. If the support of a fuzzy set is only a single object
, then
is called a fuzzy singleton. Hence, if
,
denotes a nonfuzzy singleton. In terms of the constituent singletons, the fuzzy set
with its finite number of supports
can also be expressed in union form as

where the sign + denotes the union [13]. Assignment of membership functions of a fuzzy subset is subjective in nature, and reflects the context in which the problem is viewed.
























(7.6)
The family of normal fuzzy sets produced by a fuzzy partitioning of the universe of discourse can play the role of fuzzy equivalence classes. Given a finite set
,
is a fuzzy condition attribute set in
, which generates a fuzzy equivalence partition on
. If
denotes the number of fuzzy equivalence classes generated by the fuzzy equivalence relation and
is the number of objects in
, then
-partitions of
are sets of (
) values
that can be conveniently arrayed as a (
) matrix
, which is denoted by

where
represents the membership of object
in the
th fuzzy equivalence partition or class
[20, 21].












![$${\mathbb {M}}_{{\mathbb {C}}} =[\mu _{ij}^{\mathbb {C}}]$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_7_Chapter_IEq78.gif?w=960)

(7.7)
![$$\mu _{ij}^{\mathbb {C}} \in [0,1]$$](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_7_Chapter_IEq79.gif?w=960)



Each row of the matrix
is a fuzzy equivalence partition or class. In the rough set-based feature selection method, the
function in one dimensional form is used to assign membership values to different fuzzy equivalence classes for the input miRNAs. A fuzzy set with membership function
represents a set of points clustered around
, where

where
function with
as the central point and
denotes the Euclidean norm. When the pattern
lies at the central point
of a class, then
and its membership value is maximum, that is,
. The membership value of a point decreases as its distance from the central point
, that is,
increases. When
, the membership value of
is 0.5 and this is called a crossover point [30]. The
matrix
, corresponding to the
th miRNA
, can be calculated from the
-fuzzy equivalence classes of the objects
, where

In effect, each position
of the matrix
must satisfy the following conditions:
corresponding to the miRNA
, the object
is assigned to one of the
equivalence classes based on the maximum value of memberships of the object in different equivalence classes that follows next:

Each input real-valued miRNA in quantitative form can be assigned to different fuzzy equivalence classes in terms of membership values using the
fuzzy set with appropriate
and
. The centers and radii of the
functions along each miRNA axis are determined automatically from the distribution of the training patterns. In the RSMRMS algorithm, three fuzzy equivalence classes (
), namely, low, medium, and high are considered. These three equivalence classes correspond to under-expression, baseline, and over-expression of continuous valued miRNAs, respectively. Corresponding to three fuzzy sets low, medium, and high, the following relations hold:
Get Clinical Tree app for offline access





(7.8)


















(7.9)


![$$\begin{aligned}&\mu _{kj}^{{\fancyscript{A}}_i} \in [0,1];~ \sum _{k=1}^c \mu _{kj}^{{\fancyscript{A}}_i}=1,\forall j~\text {and for any value of}\;k,\\&\mathrm{{if}}\;s=\mathrm{arg}~\max _j\{\mu _{kj}^{{\fancyscript{A}}_i}\},~ \mathrm{then}~\max _j\{\mu _{kj}^{{\fancyscript{A}}_i}\}= \max _l\{\mu _{ls}^{{\fancyscript{A}}_i}\} > 0. \end{aligned}$$” src=”http://basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_7_Chapter_Equ24.gif”></DIV></DIV></DIV>After the generation of the matrix <SPAN id=IEq107 class=InlineEquation><IMG alt=](https://i0.wp.com/basicmedicalkey.com/wp-content/uploads/2017/05/A319338_1_En_7_Chapter_IEq107.gif?w=960)




(7.10)





