is mixing matrix (or sometimes called as dictionary). Typically, t is any acquisition variable, over which a sample of mixture (a column for discrete acquisition variable) is collected. The most common types of acquisition variables are time and frequency. However, position, wave number, and other indices can be used depending on the nature of the physical process under investigation. Apart from identification of the sources, knowledge about mixing is assumed to be unknown. The generative model of the problem in its standard form can be written as:
(1)
A classical illustrative example for the BSS model is the cocktail party problem where a mixture of sound signals from simultaneously speaking individuals is available (see Fig. 1 for a simple illustration). In a nutshell, the goal in BSS is to identify and extract the sources (Fig. 1b) from the available mixture signals (Fig. 1a). This problem caught the attention of many researchers, due to its wide applicability in different scientific research areas. A general setup of the BSS problem in computational neuroscience is depicted in Fig. 2. Any surface (or scalp) noninvasive cognitive activity recording can be used as a specific example. Depending upon the scenario, the mixture can be EEG, MEG, or fMRI data. Typically, physical substances like skull, brain matter, muscles, and electrode–skull interface act as mixers. The goal is to identify the internal source signals, which hopefully reduce the mixing effect during further analysis.
Fig. 2
BSS Setup for human brain
Currently, most of the approaches of BSS in computational neuroscience are based on the statistical independence assumptions. There are very few approaches that exploit the sparsity in the signals. Sparsity assumptions can be considered as flexible approaches for BSS compared to the independence assumption, since independence requires the sources to be at least uncorrelated. In addition to that, if the number of sources is larger than the number of mixtures (underdetermined case), then the statistical independence assumption cannot reveal the sources, but can reveal the mixing matrix. For sparsity-based approaches, there are very few papers in the literature (compared to independence-based approaches) that have been devoted to develop identifiability conditions, and to develop the methods of uniquely identifying (or learning) the mixing matrix (3–6).
In this paper, an overview of the BSS problem will be presented. Sufficient identifiability conditions will be revised, and their implication on the solution methodology will be discussed. Different well-known approaches that are used to find the solution of BSS problem will also be briefly presented. Application of sparsity-based BSS methods in computational neuroscience will be highlighted by presenting various experiments on real-world data. Due to the large number of approaches for Independent Component Analysis (ICA), only the most important from our point of view will be considered.
1.1 Other Look-Alike Problems
BSS is a special type of Linear Matrix Factorization (LMF) problem. There are many other methods that can be described in the form of LMF. For instance, Nonnegative Matrix Factorization (NMF), Morphological Component Analysis (MCA), Sparse Dictionary Identification (SDI), etc. The three properties that differentiate BSS from other LMF problems are:
The model is assumed to be generative
In BSS, the data matrix X is assumed to be a linear mixture of S.
Completely unknown source and mixing matrices.
Some of the LMF methods (like MCA) assume partial knowledge about mixing.
Identifiable source and mixing matrices
Some of the LMF methods (like NMF, SDI) focus on estimating A and S without any condition for identifiability. NMF can be considered as a dimensionality reduction method like Principal Component Analysis (PCA). Similarly, SDI estimates A such that X = A S, and S is as sparse as possible. Although the problem look similar to BSS, NMF and SDI have no precise notion about the source signals or their identifiability.
2 The BSS Problem
From this point we assume that a flat representation of mixture data is given, i.e. mixture signals can be represented by a matrix containing finite number of columns. Before presenting the formal definition of the BSS problem, consider the following notations that will be used throughout the paper: A scalar is denoted by a lowercase letter, such as y. A column vector is denoted by a bold lowercase letter, such as y, and a matrix is denoted by a bold uppercase letter, such as Y. For example, in this chapter, the mixtures are represented by matrix X. An ith column of matrix X is represented as x i . An ith row of matrix X is represented as x i•. An ith row jth column element of matrix X is represented as x i, j .
Now, the BSS problem can be mathematically stated as: Let be generated by a linear mixing of sources . Given X, the objective of BSS problem is to find two matrices and S, such that the three matrices are related as X = A S. In the rest of the paper, we ignore the noise factor. Although, without noise, the problem may appear easy. However, from the very definition of the problem, it can be seen that the solution of the BSS problem suffers from uniqueness and identifiability. Thus the notion of “good” solution to the BSS problem must be precisely defined. In the following part of this section, we explain in brief the uniqueness and identifiability issues.
Uniqueness:
Let be a diagonal matrix and permutation matrix, respectively. Let A and S be such that, X = A S. Consider the following:
Thus, even if A and S are known, there can be infinite equivalent solutions of the form . The goal of good BSS solution algorithm should be to find at least one of the equivalent solutions. Due to the inability of finding the unique solution, we not only lose the information regarding the order of sources but also lose the information of energy contained in the sources. Generally, normalization of rows of S may be used to tackle scalability. Also, relative or normalized form of energy can be used in the further analysis. Theoretically any information pertaining to order of source is impossible to recover. However, practically, problem-specific knowledge will be helpful in identifying correct order for the further analysis.
Identifiability:
Let be any nonsingular matrix. Let A and S be such that, X = A S. Consider the following:
Thus, even if A and S are known, there can be infinite non-identifiable solutions of the form . The goal of BSS solution algorithm is to avoid the non-identifiable solutions. Typically, the issue of identifiability arises from the dimension and structure of A and S. The key idea to rightly identify both the matrices (of course with unavoidable scaling and permutation ambiguity) is to impose structural properties on S while solving the BSS problem (see Fig. 3). Some widely known BSS solution approaches from the literature are summarized below:
–
Statistical Independence Assumptions:
–
Sparse Assumptions:
Apart from ICA, the other types of approaches, which provide sufficient identifiability conditions are based on the notion of sparsity in the S matrix. These approaches can be named as Sparse Component Analysis (SCA) approaches. There are two distinct categories in the sparse assumptions:
1.
Partially Sparse Nonnegative Sources (PSNS):In this category, along with certain level of sparsity, the elements of S are assumed to be nonnegative. Ideas of this type of approach can be traced back to the Nonnegative Matrix Factorization (NMF) method. The basic assumption in NMF is that the elements of S (and A) are assumed to be nonnegative (9). However, in the case of BSS problem the nonnegativity assumptions on the elements of matrix A can be relaxed (10) without damaging the identifiability of A and S.
2.
Completely Sparse Components (CSC):
In this category, no sign restrictions are placed on the elements of S, i.e. . The only assumption used to define the identifiability conditions is the existence of certain level of sparsity in every column of S (11).
Fig. 3
Overview of different approaches to solve the BSS problem
At present, these are the only known BSS approaches that can provide sufficient identifiability conditions (uniqueness up to permutation and scalability). In fact, the sparsity-based approaches (see (4, 10)) are relatively new in the area of BSS when compared to the traditional statistical independence approaches (see (8)). For neurological data, recent studies have shown relevance of sparsity assumption when compared to statistical independence assumption (12). In the following sections, brief discussion on the important issues of the above listed methods will be presented.
2.1 Statistically Independent Component Analysis
ICA is one of the oldest and widely known solution approaches to the BSS problem. Herault and Jutten (1) proposed the idea of statistical independent components in sound signal processing. Since then, the method has been updated and modified in number of ways (13) and applied to different areas of signal processing (8, 14, 15). The basic approach is to transform a given data matrix X into matrix S, whose rows are as statistically independent as possible. It would be out of scope of this chapter to enumerate all the possible methods to obtain such transformation. Interested readers may refer the works (13, 16, 17) for various approaches in ICA, and some introductory level texts and tutorials (18, 19) for understanding mathematical concepts. Nevertheless, in this chapter, widely known ICA methods will be presented. Next, the basic identifiability conditions for successful implementation of any ICA approach are enlisted.
Sufficient Identifiability Conditions on A and S for ICA
ICA1
Not more than one row of S may follow Gaussian distribution.
ICA assumes no knowledge of distributions on the rows of S. However, typically it is assumed that the rows do not follow Gaussian distribution. Practically, a single row with Gaussian distribution is permissible for successful identification.
ICA2
m ≥ n.
Typically, most of the ICA algorithms are designed for m = n. There have been many algorithms that are designed for m > n, the overdetermined case (20). Moreover, there are very few algorithms that are developed for the case of m < n, the underdetermined case (21). However, for the underdetermined case, the mixing matrix is only identifiable, whereas the source matrix cannot be fully recovered.
ICA3
A has full rank.
This is another critical criterion that is absolutely necessary for the successful identification of the sources.
Implication of the Identifiability Conditions of ICA
If one of the source signals (a single row in S matrix) follows Gaussian distribution, then it is still possible to identify sources. If more than one source follow Gaussian distribution, then identification is not possible with the high order statistics. However, the second order statistics methods can be used, which guarantee independence when all the sources are Gaussian. Therefore it is assumed that the sources do not follow Gaussian distribution. This fundamental assumption is used in order to exploit the center limit theorem in source identification. Thus, many classes of algorithms are designed to find a matrix W that minimizes the Gaussianity in W T X, or maximizes the nongaussianity in W T X.
A practical approach to incorporate independence is to consider uncorrelatedness. Although uncorrelation does not imply independence, but the vice versa is true. Thus, enforcing uncorrelatedness is one of the practical methods to incorporate independence. Another key implication of uncorrelatedness is the orthogonal nature of the mixing matrix A. Even if A is not orthonormal, prewhitening method can be used to create orthonormal columns of A. This key factor, along with the deflation (22, 23) is one of the exploited criteria in the sequential extraction of the sources.
Centering, Prewhitening, and Deflation
Typically, in any ICA-based approaches, the basic preprocessing is to convert data in each row of X to zero mean, i.e., use the transformation . This process is called as centering of X, and this centering will in turn result in centering of S. Once A and S are recovered from the centered data, s i• can be updated as , to recover the original un-centered sources.
In prewhitening, the idea is to convert A into an orthonormal matrix. This is done by eigenvalue decomposition:
(2)
where is a square diagonal matrix whose elements are eigenvalues of and Q is a square orthonormal matrix of eigenvectors of . Suppose is positive definite, i.e., all the elements of are positive. A transformation matrix can be defined as:
(3)
Now the BSS problem can be transformed as:
(4)
(5)
The above transformation reduces the ill-conditioning effect due to the mixing matrix. Since ICA assumes independence, the original sources are uncorrelated, i.e., S S T = I. Therefore, as shown below:
(6)
(7)
Typically, prewhitening is not uniquely defined and can be seen by the following example: Let B be any orthornormal matrix. The is another pre-whited data. Furthermore, eigenvalue transformation is one of the methods of prewhitening, and many other methods without the usage of eigenvalue decomposition can be designed for prewhitening.
Deflation, based on Gram–Schmidt-like decorrelation scheme, is another technique often used in ICA approaches based on one unit algorithms. The key idea is that the source signals can be extracted one by one. The idea can be described in brief as follows:
Let be a vector such that s 1• = w 1 T X. It means that w 1 is proportional to one of the columns of A, say . Let be a transformation matrix, formed by orthonormal columns spanning the orthogonal complement of w 1.
Consider, the following transformation of the data matrix:
(8)
The matrix is a new mixture data matrix, which does not contain any traces of , i.e.,
where is a new mixing matrix, whose i 1th column consists of zeros due to the orthonormal construction of T. This all zeros column removes the traces of in X 1.
(9)
A particular transformation matrix, in which the columns span the orthogonal complement of w 1, can be generated as:
represents w without the qth element, and is a vector containing all zeros, except 1 at the ith position. The qth element is selected such that w 1, q ≠ 0. The above process can be repeated until the last source is extracted.
Typical Objective Functions
Kurtosis Minimization (or Maximization)
Kurtosis is the classical measure of Gaussianity, i.e. for a Gaussian random variable kurtosis is zero. Usually, absolute or square value of kurtosis is used to measure nongaussianity. However, it should be noted that there exists some nongaussian random variables with zero kurtosis. These nongaussian variables are assumed to be very rare for practical scenarios. The basic structure of kurtosis minimization(or maximization) problem can be written as:
(11a)
(11b)
where is the only variable. The above minimization(or maximization) problem will result in obtaining one of the sources, . Kurtosis minimization(or maximization) is theoretically simple and computationally inexpensive. However, the kurtosis measure itself is non-robust (24), i.e., kurtosis is very sensitive to the outliers.
(11c)
Negentropy MinimizationFrom the information theory, it is affirmed that among all the random variables of equal variances, a Gaussian random variable is known to have the largest entropy (25). Using the idea of entropy, a related measure called Negentropy, J, is defined as:
where H is the entropy function of a random variable r, and z is the Gaussian random variable of the same covariance matrix as r. From Eq. (12), it can be stated that Negentropy is zero for Gaussian variables and negative for all other variables. Negentropy (or entropy) is one of the optimal measures of nongaussianity with respect to statistical properties. However, the measure is computationally expensive due to the estimation of probability distribution function. Therefore, a practical approach is to use some approximate measures of negentropy. For example, in (8) a general structure of approximate negentropy is defined as:
(12)
where k i is any positive weight, G i are some nonquadratic functions, and random variables r and z have zero mean with unit variance. The generic function defined in Eq. (13) has a zero value for Gaussian variable and positive for any other random variable. In addition to that, suitable selection of G i will result in a robust measure, which is an added advantage apart from computational simplicity.
(13)
Likelihood MaximizationPham et al. (26, 27) defined the likelihood function as:
where f i is the probability density function of component s i•. Practically, the definition given in Eq. (14) cannot be used since f i ’s are unknown. Thus, they are estimated as from the data.
(14)
Mutual information Minimization
A general measure of dependence between n random variables r = (r 1, r 2, …, r n ) T is the mutual information I, defined as:
The key property of mutual information measure is it is zero for statistically independent variables and always positive otherwise. Typically, while minimizing the mutual information, it is preferable to add a constraint that emphasizes the uncorrelation between rows the estimated sources.
(15)
Typical ICA Approaches
The typical objective functions (described above) are minimized by different optimization algorithms. Here we discuss some well-known algorithms. Let us remind one of the possible definitions of the cumulant for given k random variables x 1, …, x k :
where the summation is taken over all possible partitions {P 1, …, P m }, m = 1, …, k of the set of the natural numbers {1, …, k}; are disjoint subsets, whose union is {1, …, k}, E is the expectation operator.
Of particular interest for us is the fourth-order cumulant,
When , it is called kurtosis of x:
Define a fourth order cumulant matrix of the sensor signals as follows (see (28)):
where is a parameter matrix, x p is the vector of delays signals (with delay p), and E is the mathematical expectation.