characters such that {A, C, G, U}. Typically, is in the hundreds, but could also be in thousands. The secondary structure of the RNA molecule is a collection of a set of stems and each stem consisting of a set of consecutive base pairs (for example, GU, GC, AU). Here, and ( and ) are connected through hydrogen bonds. If , in principle, we should require that be a complement to and that as it is known that an RNA molecule does not fold too sharply on itself.
Attempts to automatically predict the RNA secondary structure can be divided in essentially two general approaches. The first involves the overall free energy minimization by adding contributions from each base pair, bulged base, loop, and other elements [1]. The second type of approach [360] is more empirical and it involves searching for the combination of nonexclusive helices with a maximum number of base pairings, satisfying the condition of a tree-like structure for the biomolecule. Within the latter, methods using dynamic programming are the most common [360, 395]. The methods for simulating the folding pathway of an RNA molecule [312, 313, 366] and locating significant intermediate states are important for the prediction of RNA structure [29, 127, 311] and its associated function.
1.3.5 Protein Structure Prediction and Classification
Identical protein sequences result in identical 3D structures. So, it follows that similar sequences may result in similar structures, and this is usually the case. However, identical 3D structures do not necessarily indicate identical sequences as there is a distinction between homology and similarity. There are a few examples of proteins in the databases that have nearly identical 3D structures, and are therefore homologous, but do not exhibit significant or detectable sequence similarity. Pairwise comparisons do not readily show positions that are conserved among a whole set of sequences and tend to miss subtle similarities that become visible when observed simultaneously among many sequences. Hence, one wants to simultaneously compare several sequences. Structural genomics is the prediction of the 3D structure of a protein from the primary amino acid sequence [21, 60, 70, 73, 112, 128, 150, 166, 175, 219, 220, 245, 268, 280, 287, 294, 295, 297, 329]. This is one of the most challenging tasks in bioinformatics as a protein’s function is a consequence of its structure.
There are five levels of protein structure. While the primary structure is the sequence of amino acids that compose the protein, the secondary structure of a protein is the spatial arrangement of the atoms constituting the main protein backbone. The supersecondary structure or motif is the local folding pattern built up from particular secondary structures. On the other hand, tertiary structure is formed by packing secondary structural elements linked by loops and turns into one or several compact globular units called domains, that is, the folding of the entire protein chain. A final protein may contain several protein subunits arranged in a quaternary structure.
Protein sequences almost always fold into the same structure in the same environment. Hydrophobic interaction, hydrogen bonding, electrostatic, and other van der Waals type interactions also contribute to determine the structure of the protein. Many efforts are underway to predict the structure of a protein, given its primary sequence. A typical computation of protein folding would require computing all the spatial coordinates of atoms in a protein molecule, starting with an initial configuration and working up to a final minimum-energy folding configuration [31, 74, 174, 176, 273, 284, 303, 349]. Sequence similarity methods can predict the secondary and tertiary structures based on homology to known proteins. Secondary structure prediction methods include the methods proposed by Chou and Fasmann [70], and Garnier et al. [112]. Artificial neural networks [280, 287] and nearest neighbor methods [294, 295] are also used for this purpose. Tertiary structure prediction methods [349] are based on energy minimization, molecular dynamics, and stochastic searches of conformational space.
1.3.6 Molecular Design and Molecular Docking
When two molecules are in close proximity, it can be energetically favorable for them to bind together tightly. The molecular docking problem is the prediction of energy and physical configuration of binding between two molecules. A typical application is in drug design, in which one might dock a small molecule that is a described drug to an enzyme one wishes to target. For example, HIV protease is an enzyme in the AIDS virus that is essential to its replication. The chemical action of the protease takes place at a localized active site on its surface. HIV protease inhibitor drugs are small molecules that bind to the active site in HIV protease and stay there, so that the normal functioning of the enzyme is prevented. Docking software allows us to evaluate a drug design by predicting whether it will be successful in binding tightly to the active site in the enzyme. Based on the success of docking, and the resulting docked configuration, designers can refine the drug molecule [63, 188, 232, 374].
On the other hand, quantitative structure–activity relationship deals with establishing a mathematical correlation between calculated properties of molecules and their experimentally determined biological activity. These relationships may further help in predicting the activity of new analogs that can be used as a drug for specific target [124, 154, 332].
1.3.7 Phylogenetic Trees for Studying Evolutionary Relationship
All species on earth undergo a slow transformation process called evolution. To explain the evolutionary history of today’s species and how species relate to one another in terms of common ancestors, trees are constructed whose leaves represent the present-day species and intermediate nodes represent the hypothesized ancestors. These kind of labeled binary trees are called phylogenetic trees [163, 186, 191, 218, 308, 315]. Phylogenetic analysis is used to study the evolutionary relationship. Phylogenies are reconstructed based on comparisons between present-day objects. Given data for a set of objects, the phylogenetic tree reconstruction problem is to find the particular permutation of objects that optimize the given criteria. A number of algorithms are available to solve this problem [218, 308, 315].
1.3.8 Analysis of Microarray Expression Data
Microarray is one of the high throughput screening methods [281]. It measures the amount of mRNA in a sample that corresponds to a given gene or probe DNA sequence. Probe sequences are immobilized on a solid surface and allowed to hybridize with fluorescently-labeled target mRNA. The intensity of fluorescence of a spot is proportional to the amount of target sequence that has hybridized to that spot, and therefore, to the abundance of that mRNA sequence in the sample. Microarrays allow for identification of candidate genes involved in a given process based on variation between transcript levels for different conditions and shared expression patterns with genes of known function. A microarray data can be represented by a real-valued expression table [275]. A large amount of mRNA and miRNA profiling have been done and deposited in databases like Gene Expression Omnibus [94] and ArrayExpress [267]. Lot of works have been done using expression data to understand the activity of genes or nongenic elements like miRNA in several important cellular functions.
1.3.8.1 Clustering Genes or mRNAs
Clustering is one of the major tasks in gene expression data analysis [17, 157]. To understand gene function, gene regulation, cellular processes, and subtypes of cells, clustering techniques have proven to be helpful. The genes with similar expression patterns are likely to be involved in the same cellular processes, and a strong correlation of expression patterns between those genes indicates coregulation [95, 335]. Searching for common DNA sequences at the promoter regions of genes within the same cluster allows regulatory motifs specific to each gene cluster to be identified and cis-regulatory elements to be proposed [48, 335]. The inference of regulation through gene expression data clustering also gives rise to hypotheses regarding the mechanism of transcriptional regulatory network [87].
The conventional clustering methods such as hierarchical clustering [139], -means algorithm [141], self-organizing map [334], principal component analysis [223], graph theoretical approaches [30, 36, 135, 310, 372, 373, 381, 384], model-based clustering [66, 103, 114, 145, 222, 224, 341, 352, 371, 382, 383], density-based approaches [156], fuzzy clustering algorithms [50, 83, 113], and rough–fuzzy clustering algorithms [212, 213] group coexpressed genes from microarray data. Different supervised gene clustering algorithms are also developed in [84, 136, 204, 205] to find coregulated gene clusters by incorporating the information of sample categories in gene clustering process. After clustering genes, a reduced set of genes can be selected for further analysis.
1.3.8.2 Clustering miRNAs
Recent genome wide surveys on noncoding RNAs have revealed that a substantial fraction of miRNAs is likely to form clusters. The genes of miRNAs are often organized in clusters in the genome. Expression analyses have showed strong positive correlations among the closely located miRNAs, indicating that they may be controlled by common regulatory elements. In fact, experimental evidence has demonstrated that clustered miRNA loci form an operon-like gene structure and that they are transcribed from common promoter [9]. Existence of coexpressed miRNAs is also demonstrated using expression profiling analysis in [28]. Several miRNA clusters have been experimentally shown by RT-PCR or Northern blotting [54, 183]. These findings suggest that members of a miRNA cluster, which are at a close proximity on a chromosome, are highly likely to be processed as cotranscribed units.
Expression data of miRNAs can be used to detect clusters of miRNAs as it is suggested that coexpressed miRNAs are cotranscribed, so they should have similar expression pattern. The complex miRNA–mRNA networks greatly increase the challenges of comprehending and interpreting the resulting mass of data. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the pattern recognition process to reveal natural structures and identify interesting patterns in the underlying data [157]. Applying cluster techniques, miRNAs having similar cellular activities can be grouped together. In this background, several authors used hierarchical clustering in order to group miRNAs having similar function [96, 201, 354]. In [25], the self-organizing map has been used to cluster miRNA expression profile. Recently, Maji and Paul introduced rough–fuzzy clustering for identification of coexpressed miRNA clusters [212].
1.3.8.3 Selection of Genes or mRNAs
An important application of gene expression data in functional genomics is to classify samples according to their gene expression profiles, such as to classify cancer versus normal samples or to classify different types or subtypes of cancer [119]. However, among the large number of genes present in a microarray data set, only a small fraction of them is effective for classifying samples into different classes. Hence, one of the relevant and challenging problems in gene expression-based disease classification is the selection of effective genes from microarray data [326]. This is an important problem in machine learning and referred to as feature selection. In this regard, different feature selection methods have been used to select differentially expressed genes [6, 35, 119, 288, 291]. The small subset of genes is desirable in developing gene expression-based diagnostic tools for delivering precise, reliable, and interpretable results. With the gene selection results, the cost of biological experiment and decision can also be greatly reduced by analyzing only the marker genes.
1.3.8.4 Selection of miRNAs
Multiple reports have noted the utility of miRNAs for the diagnosis of cancer and other diseases [201]. The functions of miRNAs appear to be different in different cellular functions. Just as miRNA is involved in the normal functioning of eukaryotic cells, so has dysregulation of miRNA been associated with disease [184]. Different methods have been developed to identify potential miRNAs involved in a particular disease. In this background, the method called significance analysis of microarrays has been used to select potential miRNAs from expression data [151, 192, 238, 247, 274, 285]. Studies have also been conducted using various statistical tests like -test and -test for identifying differentially expressed miRNAs [129, 301, 355, 392]. The pattern recognition and machine learning techniques have been successfully used in [269, 270] to select differentially expressed miRNAs. The miRNA selection helps to infer about the miRNA–mRNA endogenous correlation associated in a disease.
1.3.8.5 Integration of mRNA and miRNA Expression Data
The high throughput techniques are used to generate a huge amount of mRNA and miRNA expression data. Individually, gene or mRNA expression data has been used to identify potential biomarkers for a wide ranges of diseases. However, gene or mRNA expression data alone often does not reflect robust molecular subtypes in many diseases. The small sample size and high number of features in the expression data put a challenge to extract meaningful information [275]. In order to generate a robust predictive model, few studies have been conducted by integrating different kinds of omics data [216, 221, 339]. While carrying out these types of integrated analyses, properties and scales have to be taken into account as well as the relations between different types of features.
The miRNA regulates gene expression at the posttranscriptional level. Hence, expression data of miRNA can provide information that is complementary to mRNA expression data. In this regard, data fusion could lead to improved classification accuracy [359]. It also helps to infer about the miRNA–mRNA endogenous correlation associated in a disease. Hence, combining miRNA and mRNA expression data, several methods have been developed that can clearly differentiate expression data into different types or subtypes and can reveal the correlation between miRNA and mRNA in a particular disease [90, 347].
1.3.8.6 Inference of Gene Regulatory Networks
One of the most challenging problems in the field of bioinformatics is inferring a gene regulatory network (GRN) from gene expression data [5]. An important and interesting question in biology, regarding the variation of gene expression levels, is how genes are regulated. Since almost all cells in a particular organism have an identical genome, differences in gene expression, and not the genome content, are responsible for cell differentiation during the life of the organism. For gene regulation, an important role is played by a type of proteins called transcription factors [308]. The transcription factors bind to specific parts of the DNA, called TFBS, which are specific, relatively short combinations of A, T, C, or G, and located in promoter regions. Transcription factors control gene expression by binding to the gene’s promoter and either activating the gene or repressing it. They are gene products and therefore, in turn, can be controlled by other transcription factors. Transcription factors can control many genes, and some genes are controlled by combinations of transcription factors. Feedback loops are also possible.
Gene expression data can be used to infer regulatory relationships. This approach is known as reverse engineering of regulatory networks. Segal et al. [306] and Timothy et al. [340] highlighted that expression data can be used to make predictions about the transcriptional regulators for a given gene or sets of genes. Segal et al. [306] have developed a probabilistic model to identify modules of coregulated genes, their transcriptional regulators, and conditions that influence regulation. Timothy et al. [340] described a method to infer regulatory relationships which uses nonlinear differential equations to model regulatory networks. In this method, a model of connections between genes in a network is inferred from measurements of system dynamics, for example, response of genes and proteins to perturbations. Greenfield et al. [121] developed a hybrid method for incorporating structure priors into global regulatory networks inference. A new model for the GRNs has been developed in [346], which builds upon existing models by adding an epigenetic control layer. An approach for network inference by integrating expression plasticity into Shannon’s mutual information is described in [356] for reconstruction of the GRNs. An integrated method has been developed for reconstructing the GRNs [88], utilizing both temporal information arriving from time-series gene expression profiles and topological properties of protein networks.
Reverse engineering based on differential equations and Bayesian networks was used by Cantone et al. [57] to identify regulatory interactions from time-series and steady-state expression data. A GRN inference algorithm from gene expression data based on differential equation model has been developed in [393]. Path consistency algorithm based on conditional mutual information has been employed for inferring the GRNs from gene expression data considering the nonlinear dependence and topological structure of the GRNs [390]. Liang and Wang [193] proposed a relevance network model for the GRN inference. Both mutual information and conditional mutual information have been used to determine the interactions between genes. Liang et al. [194] also developed a mutual information based REVerse Engineering ALgorithm, called REVEAL, for understanding interaction of genes. Later, Baladoni et al. [18] provided a new definition of mutual information using concepts from fuzzy set theory and extended the model on which the REVEAL algorithm for reverse engineering of the GRNs is based and they designed a new flexible version of it, called FuzzyREVEAL.
Microarrays have also been used in drug discovery [53], and its applications include basic research and target discovery, biomarker determination, pharmacology, toxicogenomics, target selectivity, development of prognostic tests, and disease-subclass determination. It is also used for gene set enrichment analysis [328]. Other potential bioinformatics tasks for biological problems are as follows: characterization of protein content and metabolic pathways between different genomes; identification and analysis of interacting proteins; characterization of repeats from genomes; gene mapping on chromosomes; analysis of genomic-scale censuses; assignment and prediction of gene products; large-scale analysis of gene expression levels; mapping expression data to sequence, structural, and biochemical data; development of digital libraries for automated bibliographical searches; development of knowledge bases of biological information from the literature; and development of DNA analysis methods in forensics [12, 77, 100, 137, 331, 353].
1.4 Pattern Recognition Perspective
Pattern recognition and machine learning tools and techniques have been widely used for analysis of biological data as classification, clustering, and feature selection are needed to analyze large biological data sets. Pattern recognition is the multidisciplinary research area that is concerned with the classification or description of objects. It aims to classify data or patterns based on either a prior knowledge or statistical information extracted from the data. Hence, in pattern recognition, mathematical, statistical, heuristic, and inductive techniques are utilized to execute the tasks like human being on computers [22, 85, 92, 108, 155, 209, 258, 260, 261, 263, 343].
At present, pattern recognition and machine learning provide the most fruitful framework for bioinformatics [20, 22, 209, 302, 377, 391]. They provide a wide range of linear and nonlinear, comprehensible and complex, predictive and descriptive, instance and rule-based models for different data mining tasks such as dimensionality reduction, clustering, classification, and rule discovery. Also, the methods for modeling probabilistic and fuzzy uncertainties, vagueness, and incompleteness in the discovered patterns form a part of pattern recognition research. Another aspect that makes pattern recognition algorithms attractive for computational biology and bioinformatics is their capability of learning or induction. As opposed to many statistical techniques that require the user to have a hypothesis in mind first, pattern recognition algorithms and machine learning techniques automatically analyze the data and identify relationships among attributes and entities in the data to build models that allow domain experts to understand the relationship between the attributes and the class. Several data preprocessing tasks such as instance selection, data cleaning, dimensionality reduction, and handling missing data are also extensively studied in pattern recognition framework. Besides these, other data mining issues addressed by pattern recognition methodologies include handling of relational, sequential, and symbolic biological data, knowledge encoding and extraction, knowledge evaluation, and visualization.
Pattern recognition is at the core of data mining systems. However, pattern recognition and data mining are not equivalent considering their original definitions. There exists a gap between the requirements of a data mining system and the goals achieved by present-day pattern recognition algorithms. Development of new generation pattern recognition algorithms is expected to encompass more massive biological data sets involving diverse sources and types of data that will support mixed initiative data mining, where human experts collaborate with the computer to form hypotheses and test them.
1.4.1 Pattern Recognition
Pattern recognition is a two step procedure. The first step consists of learning the invariant and common properties of a set of samples characterizing a class. While in second step, it is decided whether a new sample is a possible member of the class or not, by noting that it has properties common to those of the set of samples. The task of pattern recognition can be described as a transformation from the measurement space to the feature space and finally to the decision space ; that is,
where the mapping is the decision function, and the elements are termed as decisions [92, 209, 336].
A pattern recognition process can be decomposed into a series of few steps: data acquisition; data preprocessing; feature selection; and classification or clustering. The data acquisition phase includes gathering of data via a set of sensors depending on the environment within which the objects are to be classified. A raw data contains noise, so some preprocessing tasks such as noise reduction, filtering, encoding, and enhancement are applied on the collected data for extracting pattern vectors. The dimension of the preprocessed data is then reduced by retaining or measuring only some characteristic features or properties. However, in a broader perspective, this stage significantly influences the entire recognition process. The last phase comprises the construction of classifier, in that a transformation relationship is established between features and classes [22, 92, 155, 209, 260, 336].
1.4.1.1 Data Acquisition and Preprocessing
Data acquisition is the process of gathering data via a set of sensors depending on the environment within which the objects are to be classified. Pattern recognition techniques are applicable in a wide domain, where the data may be qualitative, quantitative, or both; they may be numerical, linguistic, pictorial, or any combination thereof. Generally, the data structures that are used in bioinformatics are of two types: object data vectors such as microarray expression data and relational data such as DNA or protein sequences. Object data, sets of numerical vectors of features, are represented as , a set of feature vectors in the -dimensional measurement space . The th object observed in the process has vector as its numerical representation; is the th ( feature associated with the th object. On the other hand, relational data are a set of numerical relationships, say , between pairs of objects. In other words, represents the extent to which objects and are related in the sense of some binary relationship . If the objects that are pairwise related by are called , then .
After data acquisition, a number of data preprocessing techniques [134] are applied on the collected data for extracting pattern vectors. Today’s real-world biological databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogeneous sources. Low-quality data may lead to low-quality mining results. The methods for data preprocessing are organized into following three categories, namely, data cleaning, data integration, and transformation. While data cleaning can be applied to remove noise and correct inconsistency in the data, the data integration merges data from multiple sources into a coherent data store. In data transformation, the data are transformed into forms appropriate for mining. For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements.
1.4.1.2 Feature Selection and Extraction
The process of feature selection or extraction includes selection of a map by which a sample in an -dimensional measurement space is transformed into a point in a -dimensional feature space, where [85, 343]. Mathematically, it finds a mapping of the form , by which a sample in an -dimensional measurement space is transformed into an object in a -dimensional feature space . The objective of feature selection or extraction is two fold: to retain or generate the optimum salient characteristics necessary for the recognition process and to reduce the dimensionality of the measurement space so that effective and easily computable algorithms can be devised for efficient classification.
In feature selection or extraction, a suitable criterion is formulated first to evaluate the goodness of a feature set and then an optimal set is searched in terms of the criterion. Features having potential to maximize (respectively, minimize) interclass (respectively, intraclass) distances, are considered to have optimal saliencies. The criterion of a good feature is that it should be unchanging with any other possible variation within a class, while emphasizing differences that are important in discriminating between patterns of different types. The major mathematical measures so far devised for the estimation of feature quality are mostly statistical in nature, and can be broadly classified into two categories, namely, feature selection in the measurement space [44, 130, 159, 172] and feature selection in a transformed space [34, 58, 86, 149]. The techniques in the first category generally reduce the dimensionality of the measurement space by discarding redundant or least information carrying features. On the other hand, those in the second category utilize all the information contained in the measurement space to obtain a new transformed space, thereby mapping a higher dimensional pattern to a lower dimensional one. This is referred to as feature extraction [85, 92, 343].
The relationship between a feature selection algorithm and the inducer or classifier chosen to evaluate the usefulness of the feature selection process can take three main forms, namely, embedded, filter, and wrapper. In embedded scheme, the inducer or classifier has its own feature selection algorithm, either explicit or implicit. The methods to induce logical conjunctions [367] and traditional machine learning tools such as decision trees and artificial neural networks [227] are a few examples of the embedded technique. The filter schemes are independent of the induction algorithm. If the feature selection process takes place before the induction step, the former can be seen as a filter of nonrelevant features prior to induction. The filter methods evaluate the goodness of the feature subset looking only at the intrinsic characteristics of the data, based on the relationship of each single feature with the class label by the calculation of simple statistics computed from the empirical distribution [79, 80, 133]. In wrapper approach [172], a search is conducted in the feature space, evaluating the goodness of each feature subset by estimating the accuracy of the specific classifier to be used [159].
The generation procedure of candidate feature subsets can be categorized into individual feature ranking [79, 80, 133] and feature subset selection [44, 130]. The former measures the relevance of each feature to the class and selects the top-ranked ones, and is commonly used due to its simplicity, scalability, and good empirical success [130]. Recently, different deterministic heuristic search methods such as sequential forward selection, sequential backward selection, sequential floating forward selection, and sequential floating backward selection [279] and nondeterministic heuristic search methods such as simulated annealing [169], genetic algorithm [143], and tabu search [117] are also used in feature selection.
The feature extraction methods determine an appropriate subspace of dimension from the original feature space of dimension , either in a linear or a nonlinear way. The linear transforms such as principal component analysis (PCA) [92], independent component analysis [34, 58, 72, 182], linear discriminant analysis [109], and projection pursuit [104] have been widely used in pattern recognition for feature extraction and dimensionality reduction. On the other hand, kernel PCA [138, 300] and multidimensional scaling [45, 241, 296] are two examples of nonlinear feature extraction techniques. The artificial neural networks have also been used for feature extraction [78, 110, 173, 200].
1.4.1.3 Classification and Clustering
In classification and clustering, a feature space is partitioned into regions, where each region represents a category of input. Accordingly, it attempts to assign every data object in the entire feature space to one of the possible classes or clusters. In real life, the classes of samples are not properly described. Instead, a finite and usually smaller number of samples are available, which often provide partial information for optimal design of feature selection algorithm or classification or clustering system. Under such circumstances, it is assumed that these samples are representative of the classes or clusters. Such a set of typical patterns is called a training set. On the basis of the information gathered from the samples in the training set, the pattern recognition systems are designed, that is, the values of the parameters of various pattern recognition methods are decided.
A classification (respectively, clustering) scheme is designed using labeled (respectively, unlabeled) data. In supervised learning , an algorithm is developed using objects with known classifications, and later it is asked to classify an unknown object based on the information acquired by it during training. Supervised learning is used for classifying different objects. Some of the well-known classifiers are Bayesian classifier [92, 343], naive Bayesian classifier [92, 343], decision tree [49, 62, 69, 225, 231, 234, 282, 292, 309, 333], multilayer perceptron [138, 140, 197], radial basis function network [138, 299], support vector machine [52, 298, 299, 351], and -nearest neighbor method [92, 343].
On the other hand, clustering is performed through unsupervised learning. In cluster analysis, a given data set is divided into a set of clusters in such a way that two objects from the same cluster are as similar as possible and the objects from different clusters are as dissimilar as possible. In effect, it tries to mimic the human ability to group similar objects into classes and categories. A number of clustering algorithms have been proposed to suit different requirements [41, 59, 92, 106, 152, 153, 160]. There are mainly two types of clustering approaches, namely, partitive clustering algorithms like -means [199, 202], -modes [146], PAM, CLARA [164], and CLARANS [97, 240]; and hierarchical methods like AGNES [81, 164], DIANA [164], Chameleon [162], ROCK [126], CURE [125], and BIRCH [389]. Aside from the above two categories, there are also two classes of clustering tasks that require special attention, namely, clustering high-dimensional data (for example, CLIQUE [3] and PROCLUS [2]) and constraint-based clustering [344, 345]. Many tools and concepts of statistical physics have also been used in pattern recognition such as fractals [91, 168, 170], renormalization group [115], Ising models [144, 327], bond percolation [148], and Gibbs–Markov random fields [147].
1.4.2 Relevance of Soft Computing
One of the important problems in bioinformatics is uncertainty. Imprecision in computations and vagueness in class definition are some of the sources of this uncertainty. The uncertainty may also be probabilistic and fuzzy in nature. Pattern recognition, by its nature, admits many approaches, to provide the appropriate solution of a given problem. For any pattern recognition system, one needs to achieve robustness with respect to random noise and failure of components and to obtain output in real time. It is also desirable for the system to be adaptive to the changes in environment. Moreover, a system can be made artificially intelligent if it is able to emulate some aspects of the human reasoning system. The system should also be able to handle nonlinear and/or overlapping classes to tackle real-life problems, and generate soft and hard decisions to make the system flexible.
Soft computing and machine learning approaches to pattern recognition are attempts to achieve these goals. Artificial neural network, decision tree, genetic algorithms, fuzzy sets, and rough sets are used as the tools in these approaches. The challenge is, therefore, to devise powerful pattern recognition methodologies by symbiotically combining these tools for analyzing biological data in more efficient ways. The systems should have the capability of flexible information processing to deal with real-life ambiguous situations and to achieve tractability, robustness, and low-cost solutions [209]. Connectionist or artificial neural network based approaches to pattern recognition are attempts to achieve some of these goals because of their major characteristics such as adaptivity, robustness or ruggedness, speed and optimality [42, 65, 138, 197, 276]. They are also suitable in data rich environments and are typically used for extracting embedded knowledge in the form of rules, quantitative evaluation of these rules, clustering, self-organization, classification, and regression. They have an advantage, over other types of machine learning algorithms, for scaling [37, 56, 89, 107, 131, 198, 210, 357, 375, 376, 380]. Investigations have also been made in the area of pattern recognition using genetic algorithms [22, 266]. Like neural networks, genetic algorithms [118] are also based on powerful metaphors from the natural world. They mimic some of the processes observed in natural evolution, which include crossover, selection, and mutation, leading to a stepwise optimization of organisms.
The fuzzy set theoretic classification approach is developed based on the realization that a pattern may belong to more than one class, with varying degrees of class membership. Accordingly, fuzzy decision theoretic, fuzzy syntactic, and fuzzy neural approaches are developed [40, 51, 258, 261]. These approaches can handle uncertainties, arising from vague, incomplete, linguistic, and overlapping patterns at various stages of pattern recognition systems [40, 161, 258, 385]. The fuzzy set theory has greater flexibility to capture various aspects of incompleteness, impreciseness, or imperfection in information about a situation as it is a generalization of the classical set theory [385]. The relevance of fuzzy set theory in the realm of pattern recognition is adequately justified in [39, 40, 161, 258, 386]. Fuzzy sets have been successfully applied in pattern classification [179], clustering [39, 40, 93, 161, 178, 217, 248, 258], and image processing [23, 105, 165, 206, 252, 253, 255, 257, 342, 370]. In addition to pattern recognition, fuzzy sets find widespread applications in solving different problems in association rule mining [16, 203, 361], fuzzy information storage and retrieval [132], functional dependency [47], data summarization [67, 181], web mining [177, 237], granular computing [26, 27], microarray data analysis [33, 83], and so forth.
The theory of rough sets [265, 271, 316] has gained popularity in modeling and propagating uncertainty. It also deals with vagueness and incompleteness. Hence, rough sets have emerged as a potential pattern recognition tool. The main idea of rough sets is to construct or synthesize approximations, in terms of upper and lower bounds of concepts, properties, or relations from the acquired data. Here, the information granules and reducts form the key notions. Information granules formalize the concept of finite precision representation of objects in real-life situations, and the reducts represent the core of an information system, both in terms of objects and features, in a granular universe. It also provides a mathematical framework to capture uncertainties associated with human cognition process [208]. It is turning out to be methodologically significant to the domains of artificial intelligence and cognitive sciences, especially in the representation of and reasoning with vague and/or imprecise knowledge, data classification, data analysis, machine learning, and knowledge discovery [246, 277, 278, 316]. This approach is relatively new as compared to connectionist and fuzzy set theoretic approaches. Rough set theory has been applied successfully to pattern classification [10, 32, 122, 123, 289, 314, 320, 322, 323], clustering [15, 82, 142, 196, 207, 208, 256], feature selection [71, 75, 211, 214, 319, 321, 323], microarray data analysis [68, 98, 211, 318, 324, 350], prediction of biological activity of molecules [210], and image processing [158, 235, 259, 363–365].
There have been several attempts over the last two decades to evolve new approaches to pattern recognition and deriving their hybrids by judiciously combining the merits of several techniques [249, 261] involving mainly fuzzy logic, artificial neural networks, genetic algorithms, and rough set theory, for developing an efficient new paradigm called soft computing [387]. Here integration is done in a cooperative, rather than a competitive, manner. The result is a more intelligent and robust system providing a human interpretable, low-cost, approximate solution, as compared to traditional techniques. Neuro-fuzzy approach is perhaps the most visible hybrid paradigm [51, 228–230, 254, 261], realized so far, in soft computing framework. Besides the generic advantages, the neuro-fuzzy approach provides the corresponding application specific merits [76, 111, 120, 330, 368, 388, 394]. Rough–fuzzy [209, 250, 265] and neuro-rough [68, 158, 264] hybridizations are also proving to be fruitful frameworks for modeling human perceptions and providing means for computing with words. The rough–fuzzy computing provides a powerful mathematical framework to capture uncertainties associated with the data. Other hybridized models for pattern recognition and data mining include neuro-genetic [46, 215, 251, 293, 362], rough-genetic [43, 317, 369], fuzzy-genetic [14, 61, 64, 116, 180, 233], rough-neuro-genetic [167], rough-neuro-fuzzy [11, 24, 243, 244, 262], and neuro-fuzzy-genetic [187, 195, 283, 290, 307, 358] approaches.