Fig. 1
Vision-based hand gesture recognition system architecture
In the following sections we will describe the problems of hand posture classification and dynamic gesture classification.
2.1 Hand Posture Classification
For hand posture classification, hand segmentation and feature extraction is a crucial step in vision-based hand gesture recognition systems. The pre-processing stage prepares the input image and extracts features used later with classification algorithms [36]. The proposed system uses feature vectors composed of centroid distance values for hand posture classification. The centroid distance signature is a type of shape signature [36] expressed by the distance of the hand contour boundary points, from the hand centroid (xc, yc) and is calculated in the following manner:
(1)
This way, a one-dimensional function representing the hand shape is obtained. The number of equally spaced points N used in the implementation was 16. Due to the subtraction of centroid from the boundary coordinates, this operator is invariant to translation as shown by Rayi Yanu Tara [32] and a rotation of the hand results in a circularly shift version of the original image. All the features vectors are normalized, using the z-normalization, prior to training, by subtracting their mean and dividing by their standard deviation [1, 23] as follows,
(2)
where is the mean of the instance i, and σ is the respective standard deviation, achieving this way scale invariance as desired. The vectors thus obtained have zero mean and a standard deviation of 1. The resulting feature vectors are used to train a multi-class Support Vector Machine (SVM) that is used to learn the set of hand postures shown in Fig. 2, and used in the Referee Command Language Interface System (ReCLIS) and the hand postures shown in Fig. 3 used with the Sign Language Recognition System. The SVM is a pattern recognition technique in the area of supervised machine learning, which works very well with high-dimensional data. SVM’s select a small number of boundary feature vectors, support vectors, from each class and builds a linear discriminant function that separates them as widely as possible (Fig. 4)—maximum-margin hyperplane[40]. Maximum-margin hyperplanes have the advantage of being relatively stable, i.e., they only move if training instances that are support vectors are added or deleted. SVM’s are non-probabilistic classifiers that predict for each given input the corresponding class. When more than two classes are present, there are several approaches that evolve around the 2-class case [33]. The one used in the system is the one-against-all, where c classifiers have to be designed. Each one of them is designed to separate one class from the rest.
Fig. 2
The defined and trained hand postures
Fig. 3
Manual alphabet for the Portuguese Language
2.1.1 Model Training
For feature extraction, model learning and testing, a C++ application was built with openFrameworks [18], OpenCV [4], OpenNI [27]and the Dlib machine-learning library [15]. OpenCV was used for some of the vision-based operations like hand segmentation and contour extraction, and OpenNI was responsible for the RGB and depth image acquisition. Figure 5 shows the main user interface for the application, with a sample vector (feature vector) for the posture being learned displayed below the RGB image.
Fig. 5
Static gesture feature extraction and model learning user interface
Two centroid distance datasets were built: the first one for the first seven hand postures defined, with 7848 records and the second one for the Portuguese Sign Language vowels with a total of 2170 records, obtained from four users. The features thus obtained were analysed with the help of RapidMiner (Miner) in order to find the best kernel in terms of SVM classification for the datasets under study. The best kernel obtained with a parameter optimization process was the linear kernel with a cost parameter C equal to one. With these values, the final achieved accuracy was 99.4 %.
In order to analyse how classification errors were distributed among classes, a confusion matrix for the two hand posture datasets was computed with the final results shown in Tables 1 and 2.
Table 1
Confusion matrix for the seven hand postures trained
Actual class | |||||||
---|---|---|---|---|---|---|---|
Predicted class | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 602 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2 | 712 | 0 | 0 | 1 | 0 | 1 |
3 | 0 | 1 | 578 | 1 | 0 | 0 | 0 |
4 | 0 | 0 | 12 | 715 | 3 | 0 | 0 |
5 | 0 | 1 | 1 | 13 | 542 | 1 | 3 |
6 | 1 | 2 | 0 | 1 | 5 | 701 | 12 |
7 | 0 | 0 | 0 | 2 | 0 | 1 | 751 |
Table 2
Confusion matrix for the Portuguese Sign Language vowels
Actual class | |||||
---|---|---|---|---|---|
Predicted class | 1 | 2 | 3 | 4 | 5 |
1 | 455 | 0 | 0 | 2 | 0 |
2 | 0 | 394 | 1 | 1 | 0 |
3 | 0 | 0 | 401 | 1 | 0 |
4 | 4 | 2 | 0 | 382 | 0 |
5 | 0 | 0 | 1 | 0 | 439 |
2.2 Dynamic Gesture Classification
Dynamic gestures are time-varying processes, which show statistical variations, making Hidden Markov Models (HMMs) a plausible choice for modelling the processes [29, 42]. A Markov Model is a typical model for a stochastic (i.e. random) sequence of a finite number of states [10]. When the true states of the model are hidden in the sense that they cannot be directly observed, the Markov model is called a Hidden Markov Model (HMM). At each state an output symbol is emitted with some probability, and the state transitions to another with some probability, as shown in Fig. 7. With discrete number of states and output symbols, this model is sometimes called a “discrete HMM” and the set of output symbols the alphabet. In summary, an HMM has the following elements:
N: the number of states in the model ;
M: the number of distinct symbols in the alphabet ;
State transition probabilities:
Observation probabilities:
Initial state probabilities: ;
and is defined as , where N and M are implicitly defined in the other parameters. The transition probabilities and the observation probabilities are learned during the training phase, with known data, which makes this is a supervised learning problem [36].
In this sense, a human gesture can be understood as a HMM where the true states of the model are hidden in the sense that they cannot be directly observed. So, for the recognition of dynamic gestures a HMM model was trained for each possible gesture. HMMs have been widely used in a successfully way in speech recognition and hand writing recognition [28]. In the implemented system, the 2D hand trajectory points are used and labelled according to the distance to the nearest centroid, based on Euclidean distance. The resulting vector is then translated to origin resulting in a discrete feature vector like the one shown in Fig. 6.
Fig. 6
Gesture path with respective feature vector