Gesture Recognition System Based in Computer Vision and Machine Learning



Fig. 1
Vision-based hand gesture recognition system architecture



In the following sections we will describe the problems of hand posture classification and dynamic gesture classification.


2.1 Hand Posture Classification


For hand posture classification, hand segmentation and feature extraction is a crucial step in vision-based hand gesture recognition systems. The pre-processing stage prepares the input image and extracts features used later with classification algorithms [36]. The proposed system uses feature vectors composed of centroid distance values for hand posture classification. The centroid distance signature is a type of shape signature [36] expressed by the distance of the hand contour boundary points, from the hand centroid (xc, yc) and is calculated in the following manner:





$$d\left( i \right)=\sqrt{{{\left( {{x}_{i}}-{{x}_{c}} \right)}^{2}}+{{\left( {{y}_{i}}-{{y}_{c}} \right)}^{2}}},~i=0,~\ldots .,~N-1$$

(1)

This way, a one-dimensional function representing the hand shape is obtained. The number of equally spaced points N used in the implementation was 16. Due to the subtraction of centroid from the boundary coordinates, this operator is invariant to translation as shown by Rayi Yanu Tara [32] and a rotation of the hand results in a circularly shift version of the original image. All the features vectors are normalized, using the z-normalization, prior to training, by subtracting their mean and dividing by their standard deviation [1, 23] as follows,





$$Z=\left( {{a}_{ij}}-\bar{a} \right)/\sigma $$

(2)

where 
$$\rm{\bar{a}}$$
is the mean of the instance i, and σ is the respective standard deviation, achieving this way scale invariance as desired. The vectors thus obtained have zero mean and a standard deviation of 1. The resulting feature vectors are used to train a multi-class Support Vector Machine (SVM) that is used to learn the set of hand postures shown in Fig. 2, and used in the Referee Command Language Interface System (ReCLIS) and the hand postures shown in Fig. 3 used with the Sign Language Recognition System. The SVM is a pattern recognition technique in the area of supervised machine learning, which works very well with high-dimensional data. SVM’s select a small number of boundary feature vectors, support vectors, from each class and builds a linear discriminant function that separates them as widely as possible (Fig. 4)—maximum-margin hyperplane[40]. Maximum-margin hyperplanes have the advantage of being relatively stable, i.e., they only move if training instances that are support vectors are added or deleted. SVM’s are non-probabilistic classifiers that predict for each given input the corresponding class. When more than two classes are present, there are several approaches that evolve around the 2-class case [33]. The one used in the system is the one-against-all, where c classifiers have to be designed. Each one of them is designed to separate one class from the rest.



A329170_1_En_21_Fig2_HTML.gif


Fig. 2
The defined and trained hand postures



A329170_1_En_21_Fig3_HTML.gif


Fig. 3
Manual alphabet for the Portuguese Language



A329170_1_En_21_Fig4_HTML.gif


Fig. 4
SVM: support vectors representation with maximum-margin hyperplane [31]


2.1.1 Model Training


For feature extraction, model learning and testing, a C++ application was built with openFrameworks [18], OpenCV [4], OpenNI [27]and the Dlib machine-learning library [15]. OpenCV was used for some of the vision-based operations like hand segmentation and contour extraction, and OpenNI was responsible for the RGB and depth image acquisition. Figure 5 shows the main user interface for the application, with a sample vector (feature vector) for the posture being learned displayed below the RGB image.



A329170_1_En_21_Fig5_HTML.gif


Fig. 5
Static gesture feature extraction and model learning user interface

Two centroid distance datasets were built: the first one for the first seven hand postures defined, with 7848 records and the second one for the Portuguese Sign Language vowels with a total of 2170 records, obtained from four users. The features thus obtained were analysed with the help of RapidMiner (Miner) in order to find the best kernel in terms of SVM classification for the datasets under study. The best kernel obtained with a parameter optimization process was the linear kernel with a cost parameter C equal to one. With these values, the final achieved accuracy was 99.4 %.

In order to analyse how classification errors were distributed among classes, a confusion matrix for the two hand posture datasets was computed with the final results shown in Tables 1 and 2.




Table 1
Confusion matrix for the seven hand postures trained























































































Actual class

Predicted class

1

2

3

4

5

6

7

1

602

0

0

0

0

0

0

2

2

712

0

0

1

0

1

3

0

1

578

1

0

0

0

4

0

0

12

715

3

0

0

5

0

1

1

13

542

1

3

6

1

2

0

1

5

701

12

7

0

0

0

2

0

1

751




Table 2
Confusion matrix for the Portuguese Sign Language vowels























































Actual class

Predicted class

1

2

3

4

5

1

455

0

0

2

0

2

0

394

1

1

0

3

0

0

401

1

0

4

4

2

0

382

0

5

0

0

1

0

439


2.2 Dynamic Gesture Classification


Dynamic gestures are time-varying processes, which show statistical variations, making Hidden Markov Models (HMMs) a plausible choice for modelling the processes [29, 42]. A Markov Model is a typical model for a stochastic (i.e. random) sequence of a finite number of states [10]. When the true states of the model 
$$S=\left\{ {{s}_{1}},{{s}_{2}},{{s}_{3}},\ldots ,{{s}_{N}} \right\}$$
are hidden in the sense that they cannot be directly observed, the Markov model is called a Hidden Markov Model (HMM). At each state an output symbol 
$$O=\left\{ {{o}_{1}},{{o}_{2}},{{o}_{3}},\ldots ,{{o}_{N}} \right\}$$
is emitted with some probability, and the state transitions to another with some probability, as shown in Fig. 7. With discrete number of states and output symbols, this model is sometimes called a “discrete HMM” and the set of output symbols the alphabet. In summary, an HMM has the following elements:





  • N: the number of states in the model 
$$S=\left\{ {{S}_{1}},{{S}_{2}},\ldots ,{{S}_{N}} \right\}$$
;


  • M: the number of distinct symbols in the alphabet 
$$V=\left\{ {{v}_{1}},{{v}_{2}},\ldots ,{{v}_{M}} \right\}$$
;


  • State transition probabilities:





$$A=\left[ {{a}_{ij}} \right]\text{ }\!\!~\!\!\text{ }where\text{ }\!\!~\!\!\text{ }{{a}_{ij}}\equiv P\left( {{q}_{t+1}}={{S}_{j}}\text{ }\!\!|\!\!\text{ }{{q}_{t}}={{S}_{i}} \right)\text{ }\!\!~\!\!\text{ }and\text{ }\!\!~\!\!\text{ }{{q}_{t}}\text{ }\!\!~\!\!\text{ }is\text{ }\!\!~\!\!\text{ }the\text{ }\!\!~\!\!\text{ }state\text{ }\!\!~\!\!\text{ }at\text{ }\!\!~\!\!\text{ }time\text{ }\!\!~\!\!\text{ }t$$





  • Observation probabilities:





$$B=\left\{ {{b}_{j}}\left( m \right) \right\}\text{ }\!\!~\!\!\text{ }where\text{ }\!\!~\!\!\text{ }{{b}_{j}}\left( m \right)\equiv P\left( {{O}_{t}}={{v}_{m}}|{{q}_{t}}={{S}_{j}} \right)\text{ }\!\!~\!\!\text{ }and\text{ }\!\!~\!\!\text{ }O\text{ }\!\!~\!\!\text{ }is\text{ }\!\!~\!\!\text{ }the\text{ }\!\!~\!\!\text{ }observation\text{ }\!\!~\!\!\text{ }sequence$$





  • Initial state probabilities: 
$$\mathbf{\Pi }=\text{ }\!\!~\!\!\text{ }\left[ {{\pi }_{i}} \right]\text{ }\!\!~\!\!\text{ }where\text{ }\!\!~\!\!\text{ }{{\pi }_{i}}\equiv P\left( {{q}_{1}}={{S}_{i}} \right)$$
;

and is defined as 
$$\lambda =\left( A,~B,~\text{ }\!\!\Pi\!\!\text{ } \right)$$
, where N and M are implicitly defined in the other parameters. The transition probabilities and the observation probabilities are learned during the training phase, with known data, which makes this is a supervised learning problem [36].

In this sense, a human gesture can be understood as a HMM where the true states of the model are hidden in the sense that they cannot be directly observed. So, for the recognition of dynamic gestures a HMM model was trained for each possible gesture. HMMs have been widely used in a successfully way in speech recognition and hand writing recognition [28]. In the implemented system, the 2D hand trajectory points are used and labelled according to the distance to the nearest centroid, based on Euclidean distance. The resulting vector is then translated to origin resulting in a discrete feature vector like the one shown in Fig. 6.

Jun 14, 2017 | Posted by in GENERAL SURGERY | Comments Off on Gesture Recognition System Based in Computer Vision and Machine Learning

Full access? Get Clinical Tree

Get Clinical Tree app for offline access