To build an AdaBoost classifier it is necessary to choose two sets of images: the positive, which contains the object one wants to map and the negative, which contains other objects. After defining those two groups of images, three algorithms provided by the set of OpenCV libraries [2] are used. They are: Objectmarker, CreateSamples and Traincascade
Objectmarker is responsible for scoring the positive images of the objects of interest, creating a file containing the image name and the coordinates of the marking area. Such text file is converted into a vector through the tool CreateSamples that while standardizing brightness, lighting, and suitably scaling the window to the size for the images to be cropped from the group of positive images. The default size chosen for the images of this chapter is 20 by 20 pixels. The greater the number of images and variations regarding illumination, reflection, backgrounds, scaling, rotation, etc. in this step, the more accurate is the resulting classifier.
According to reference [9], each stage of the cascading should be independent of the others, allowing creating a simple tree. When it is necessary to increase the accuracy of the classifier, more images or more stages to the tree must be added. Many references, such as [13, 14, 22, 23], suggest that in order to reach an accurate classifier about 10,000 images are necessary.
This project made use of 2000 images acquired through an image capture software written in Java. Such number of images was empirically defined by tuning the number of images for the tree construction features. The process started with 500 images, and it was found that increasing number of images, each stage became stronger, improving the classifier eventually.
Another relevant feature that must be observed is the resolution of the images used. While the literature indicates the use of images with dimensions of 640 × 480 pixels, this study used images with resolution 320 × 240 pixels, obtained from a camera with native resolution of 12 mega pixels, which greatly increased the number of perceived characteristics at each stage and a performance far superior to that obtained with images of 640 × 480 pixels.
Finally, after these two steps, the vector of positive images and folder containing the negative images are submitted to the algorithm Traincascade that performs the training and the creation of the cascade of classifiers. This algorithm compares the positive and negative images, used as a background, attempting to find edges and other features [17]. This is the step that is more time intensive to execute, thus it was important to monitor the estimates that are displayed on the screen and see if the classifier would be either effective or not based on the successes and false alarm rates at each stage. Reference [9] indicates that it takes at least 14 steps to start the process of recognition of some object.
The Traincascade algorithm trains the classifier with the submitted sample, and generates a cascade using Haar-type features. Despite the importance of determining the texture, the detection of the shape of an object is a recurring problem in machine vision. Reference [9, 11] proposed the use of rectangular features, known as Haar-like, rather than the color intensities to improve the inference to the shape of an object and increase the accuracy of the classifier from a concept called integral image. From the integral image it is possible to calculate the sum of values in a rectangular region in constant time, simplifying and speeding up the feature extraction in image processing.
An image is composed of pixels containing information of the intensities of its layers of colors, ranging from 0 (darker) to 255 (lighter) for each color channel. The most widely used color systems have three components, such as RGB and HSV [21]. Those representation modes require a higher computational effort and more storage space than binary ones. Thus, the use of binary vision systems, if proven adequate, allows much faster processing and more compact representation. Such images are extremely important for real time applications, in which it is necessary to speedily process the feature extraction to deliver the results to the recognition algorithm. In general, binary vision systems are useful in cases where the contour contains enough information to allow recognizing objects even in environments with uneven lighting. The vision system typically uses a binary threshold to separate objects from the background. The appropriate value of such threshold depends on the lighting and the reflective characteristics of the objects. The effective object-background separation claims that the object and background have sufficient contrast and that the intensity levels of both objects and the background are known [21].
In order to create an integral image, reference [9] used binarized images to simplify the description of the features. The result of the cascaded process is saved in a file with Extensible Markup Language (XML).
3.2 Image Processing
A software module was developed to enable the camera to capture images, process them, and submit them to the classifier. Multiple gestures recognition was achieved through the use of threads.
To increase the possibility of using the algorithm in different environments, various methods of image processing were used to minimize the noise level and also elements that do not make gestures mapped to classifiers. Overall, the technical literature divides a recognition system of objects and gestures into four parts [7]: Pre-processing; Segmentation; Feature extraction and Statistical Classification. The following sub-sections describe the main features of each of them.
3.2.1 Pre-Processing
System calibration tasks, geometric distortion correction, and noise removal take place in the pre-processing stage. One of the concerns in the pre-processing is the removal of noise caused by many factors, such as resolution of the equipment used, lighting, distance from the object or gesture over the camera, etc. Salt-and-pepper noises often appear in the images. The white pixels scattered in the image, called salt noises, are the pixels of a particular image region that have high value surrounded by low value pixels. The pepper noise is the opposite situation to that of salt noise. There are two ways to process those noises: using morphological transformations or applying Gaussian smoothing methods to approximate the values of the pixels and decreasing the perception of such noises.
In a digital image represented on a grid, a pixel has a common border with four pixels and shares a common corner with four additional pixels. It is said that two pixels are 4-neighbors if they share a common border. [8, 10] Similarly, two pixels are 8-neighbors if they share at least one corner. For example, a pixel at location [i, j] is 4-neighbors [i + 1, j], [i-1, j], [i, j + 1] and [i, j-1]. The 8-neighbors of the pixel including the four-nearest neighbor [i + 1, j + 1], [i + 1, j-1] [i-1, j + 1] and [i-1, j-1]. Figure 6 shows how the pixels are presented in order neighbors and 4-8-neighbors. (Fig. 2)
Fig. 2
Pixel neighborhood
The morphological operations used in this study were erosion, which removed the pixels that did not meet the minimum requirements of the neighborhood and dilation, which entered pixels in the image is crafted by erosion, also according to a pre-determined neighborhood. After applying the morphological transformation, a smoothing operation takes place. Such transformation performs the approximation of the values of the pixels, attempting to blur or to filter out the noise or other fine-scale or dispersed structures. The model used in this project was the 3 × 3 Gaussian Blur, also known as Gaussian smoothing. The visual effect of such technique is a blurred soft similar to display the image on a translucent screen.
3.2.2 Segmentation
Image segmentation consists in the extraction and identification of objects of interest contained in the image, where the object is the entire region with semantic content relevant to the desired application. After segmentation, each object is described by their geometric and topological properties, for example, attributes such as area, shape and texture of objects can be extracted and used later in the analysis process. Image segmentation can be performed by the basic properties of gray level values, detecting discontinuities or similarities. The discontinuities may be dots, lines, edges, which one can apply a mask to highlight the type of discontinuity that may exist.
After, filters are used to detect similarity, merging them as edges. The first filter used was Sobel, which is an operator that calculates the finite difference, giving an approximation of the gradient of intensity of image pixels. The second filter used was Canny [3] that smoothen the noise and finds edges by combining a differential operator with a Gaussian filter.
3.2.3 Feature Extraction
A feature extractor is used to reduce the space of the significant image elements, that is, a facilitator of the classification process and is often applied not only for the recognition of objects, but also to group together similar characteristics in the image segmentation process [24]. Therefore feature extraction is a way to achieve dimensional reduction. This task is especially important in real-time applications because they receive a stream of input data that must be processed immediately. Usually, there is a high degree of redundancy is such data stream (much data with repeated information) and need to be reduced to a set of representative features. If the extracted features are carefully chosen, this set is expected to bring relevant information to perform a task. The steps taken here to sieve the significant pixels in the data stream were Motion detection and Skin detection, which are detailed next.
3.2.4 Motion Detection
The technique chosen to perform the motion detection consisted in making a background subtraction, removing the pixels that have not been altered from the previous frame, thereby decreasing the number of pixels to be subjected to the subsequent process of gesture recognition. The algorithm performed the following steps:
Capture two frames;
Compare the colors of the pixels in each frame;
If the color is the same, replace the original color by a white pixel. Otherwise leave it unchanged.
This algorithm, while reducing the amount of pixels in the image that will be presented to the process of gesture recognition can still display some elements that do not relate to the gesture itself, such as the clothing of the user or other object that may be moving the captured images and that will only increase the need of processing without the end result being of any relevance.
A second way to reduce the amount of pixels is applying a color filter. As the goal is to track gestures, the choice was Skin detection.
3.2.5 Skin Detection
There may be many objects in the environment that have the same color as the human skin, which varies in color, hues, color, intensity and position of the illumination source, the environment the person is in, etc. In such cases, even a human observer cannot determine whether a particular color was obtained from a region of the skin or an object that is in the image without taking into account contextual information. An effective model of skin color should solve this ambiguity between skin colors and other objects.
It is not a simple task to build a model of skin color that works in all possible lighting conditions. However, a good model of skin color must have some kind of robustness to succeed even in varying lighting conditions. A robust model requires an algorithm for color classification and a color space in which all objects are represented. There are many algorithms, including multilayer perceptrons [15], self-organizing maps, linear decision boundaries [6, 15], and based on the probabilistic density estimation [24]. The choice of color space is also varied: RGB [21] YCbCr [6], HSV [6], CIE Luv [24] Farnsworth UCS [15], and normalized RGB [21].