Fig. 1
Kinect main components
This work presents a survey of the main recent works from the literature related to 3D scanning using RGBD cameras, in special, the Kinect Sensor. The goal is to provide a wide survey of the area, providing references and introducing the methodologies and applications, from the simple reconstruction of small static objects to the constantly updated mapping of dense or large scenarios, in order to motivate and enable other works in this topic.
2 Methods
In this Section, we present the survey of methods found on the literature. For each method, we present: references, the responsible institution, release date, availability of the software or source-code, a general overview and a brief description of the method.
It is necessary to take into consideration the high rate of improvements and innovations in this topic currently, therefore, the methods in this section are limited to the progress of the new technologies until the time of conclusion of this work.
2.1 Kinectfusion
Developed at: Microsoft Research Cambridge
Released in: October 2011
Availability: There is an open-source implementation in C++, called Kinfu, in the PCL (Point Cloud Library) project [3]. An implementation within Microsoft’s Kinect for Windows SDK [4] will be released.
General Description
In this method, only the depth image is used to track the sensor position and reconstruct 3D models of the physical scenario in real-time, limited to a fixed resolution volume (typically 512³), through a GPU implementation. The RGB camera information is only used in the case of texture mapping. Although the approach aims at speed efficiency to explore real-time rates, it is not GPU memory efficient, requiring above 512 MB of capacity and 512 or more float-point colors, for an implementation using 32-bit voxels.
The authors show some potential applications of the KinectFusion modifying or extending the GPU pipeline implementation to use in 3D scanning, augmented reality, object segmentation, physical simulations and interactions with the user directly in the front of the sensor. Because of the speed and accuracy, the method has generated various other improved extensions for different applications.
Approach
Basically, the system continually tracks the six degrees of freedom (DOF) pose of the camera and fuses, in real time, the camera depth data in a single global 3D model of a fixed size 3D scene. The reconstruction is incremental, with a refined model as the camera moves, even by vibrating, resulting in new viewpoints of the real scenario revealed and fused into the global model.
The main system of the GPU pipeline consists of four steps executed concurrently, using the CUDA language:
1.
Surface measurement: the depth map acquired directly from the Kinect is converted in a vertex map and a normal map. Bilateral filtering is used to reduce the inherent sensor noise. Each CUDA thread works in parallel in each pixel of the depth map and projects as a vertex in the coordinate space of the camera, to generate a vertex map. Also each thread computes the normal vector for each vertex, resulting in a normal map.
2.
Camera pose tracking: the ICP (Iterative Closest Point) algorithm, implemented in GPU, is used in each measurement in the 640 × 480 depth map, to track the camera pose at each depth frame, using the vertex and normal maps. Therefore, a 6-DOF rigid transformation is estimated for approximate alignment of the oriented points with the ones from the previous frame. Incrementally, the estimated transformations are applied to the transformation that defines the Kinect global position.
3.
Volume integration: a 3D fixed resolution volume is predefined, mapping the specific dimensions of a 3D fixed space. This volume is subdivided uniformly in a 3D grid of voxels. A volumetric representation is used to integrate the 3D global vertices of the conversion of the oriented points in global coordinates from the camera global position, into voxels, through a GPU implementation of the volumetric TSDF (Truncated Signed Distance Functions). The complete 3D grid is allocated in the GPU as linear aligned memory.
4.
Surface prediction: raycasting of the volumetric TSDF is performed at the estimated frame to extract views from the implicit surface for depth map alignment and rendering. In each GPU thread, there is a single ray and it renders a single pixel at the output image.
The rendering pipeline allows conventional polygon-based graphics to be composed in the raycasting view, enabling the fusion of real and virtual scenes, including shadowing, all through a single algorithm. Moreover, there is data generation for better camera tracking by ICP algorithm.
2.2 Moving Volume Kinectfusion
Developed at: Northeastern University/CCIS
Released in: September 2012
Availability: implementation to be included in the PCL project.
General Description
Moving Volume KinectFusion is an extension of KinectFusion, with additional algorithms, to allow the translation and rotation of the 3D volume, fixed in the basis approach, as the camera moves. The main goal is the application in mobile robotic perception in rough terrain, providing simultaneously visual odometry and a local scenario dense map.
The implementation was based on the open-source Point Cloud Library (PCL)’s Kinfu [3] by Willow Garage organization and these modifications were submitted to be available at the project.
The requirements for the processing are similar to those of the original implementation and the authors used for tests, Intel Xeon W3520 processor (4 cores, 12 GB RAM, 2.8 GHz) and NVidia GeForce GTX580 GPU (512 cores, 3 GB RAM, 1.5 GHz).
Approach
The method performs simultaneously the global camera pose tracking and the building of the local surroundings spatial map.
Considering the KinectFusion approach after the tracking step, it is determined if a new volume frame is needed, calculating the linear and angular offsets, relative to the camera local pose.
To introduce a new volume frame, it is used remapping, that interpolates the TSDF values at the previous volume in the grid of corresponding points to the samples of the new rotated and translated volume. A swap buffer is kept in the GPU memory of the same size as the TSDF buffer and buffer swap is performed after remapping. Therefore, defining a new volume transformation.
A fast memory displacement re-sampling algorithm is used. During the resampling, a search is done at the closest neighbor, and if it is in the interval of truncation, a trilinear interpolation is performed.
2.3 Kintinuous
Developed at: National University of Ireland Maynooth and CSAIL/MIT
Released in: July 2012 (first version – RSS Workshop on RGBD:Advanced Reasoning with Depth Cameras), September 2012 (current version—submitted to ICRA’13).
Availability: implementation to be included in the PCL project.
General Description
Kintinuous is another KinectFusion extension with the aim of mapping large scale scenarios in real time. The algorithm is modified so that the fixed volume to be mapped in real-time can be dynamically changed, with the corresponding point-cloud continuously incremented for triangular mesh representation.
As well as the Moving Volume KinectFusion, the authors here have also used the Kinfu implementation from PCL as basis. The hardware used for evaluation tests was a Intel Core i7- 2600 3.4 GHz CPU, 8 GB DDR 1333 MHz RAM and NVidia GeForce GTX 560 Ti 2 GB GPU, performed in 32-bit Ubuntu 10.10.
Approach
The method is an extension of the KinectFusion method, in the sense that the same original algorithm is used in the region bounded by the threshold. However several modifications were incorporated:
Continuous representation: It is allowed to the mapped area by the TSDF to move along the time, in tracking and reconstruction. The basic process of the system is: (i) Determine the camera distance to the origin: if above a specific threshold, translate virtually the TSDF to centralize the camera. (ii) Add the surface excluded from the TSDF region into a position graph, and initialize a new region entering the TSDF, as not mapped.
Implementation: (i) A cyclic vector of TSDF voxels is used. (ii) Surface points are extracted by orthogonal raycasting through each TSDF slice, which are zeroed after that, where the zero-crossings are extracted as reconstructed surface vertices. (iii) A voxel grid filter is applied to remove the possible duplicate points in the orthogonal raycasting.
Pose graph representation: The pose graph representation is used to represent the external meshes, where each position stores a surface slice.
Mesh generation: Uses the greedy mesh triangulation algorithm described by Marton et al. [8].
Visual odometry: The ICP odometry estimation is replaced by a GPU implementation of the dense odometry algorithm based in RGB-D presented by Steinbruecker et al. [9] integrated to the KinectFusion GPU pipeline.
2.4 RGB-D Mapping
Developed at: University of Washington
Released in: December 2010 (ISER—first version)/March 2012 (current version)
Availability: No implementation available.
General Description
The RGB-D Mapping method is a complete 3D mapping system that aims the 3D reconstruction of interior scenarios, building a global model using surfels to enable compact representations and visualization of 3D maps. It provides a new joint optimization algorithm that combines visual characteristics and alignment based in shapes. The system is implemented using the Robot Operating System (ROS) framework [12].
Approach
The method basically consists of:
1.
Use of FAST (Features from Accelerated Segment Test) characteristics and Calonder descriptors to extract sparse visual characteristics of two frames and associate them with their respective depth values to generate characteristics points in 3D.
2.
For the correspondences between two RGB-D frames, it uses the RGB-D ICP algorithm of two stages: sparse characteristic points extraction to visual appearance incorporation and correspondence by RANSAC (Random Sample Consensus).
3.
Alignment between two frames using the implemented algorithm of joint optimization over appearance and shape correspondences.
4.
Loop closure detection corresponding data frames to a previously collected set of frames.
5.
Sparse Bundle Adjustment (SBA) is used for global optimization and incorporation of ICP restrictions in SBA.
2.5 RGB-DSLAM
Developed at: University of Freighburg/TUM.
Released in: April 2011. (RGB-D Workshop on 3D Perception in Robotics at the European Robotics Forum, 2011)
Availability: open-source implementation available5.
General Description
This method allows a robot to generate 3D colored models of objects and interior scenarios. The approach is similar to the RGB-D Mapping. It won the first prize in the “most useful” category at the ROS 3D challenge organized by Willow Garage.
Approach
The method consists of four steps:
1.
SURF (Speeded Up Robust Feature) characteristics extraction from the RGBD input images and its correspondence with previous images.
2.
Computing the depth images in the local of these characteristic points, a set of 3D punctual correspondences between these two frames is obtained. Based on these correspondences, RANSAC is used to estimate the relative transformation between the frames.
3.
Generalized ICP algorithm proposed by Segal et al. [15] is used to improve the initial estimation.
4.
Resulting pose graph optimization, using HOG-Man, proposed by Grisetti et al. [16], to take the pose estimations between frames, consistent in the global 3D model.
2.6 Solony et al.
Reference: [17]
Developed at: Brno University of Technology
Released in: June 2011 (STUDENT EEICT)
Availability: No implementation available.
General Description
This method aims to build a dense 3D map of interior environments, through multiple Kinect depth images. It can be used to produce effectively dense 3D maps of small workspaces.
The algorithm accumulates errors, caused by small inaccuracies in the camera pose estimation between consecutive frames, since it is not used any loop closure algorithm.
Therefore, when compared to modern algorithms of map construction, this solution is not a SLAM (Simultaneous Localization And Mapping) algorithm, because it does not predict uncertainty in the camera position.
Approach
The camera position is initialized in the center of global coordinates and the rotation is aligned with the negative z axis. The sparse set of characteristic points is extracted, and the new position is estimated in the camera tracking:
1.
The SURF algorithm is applied to extract the set of visual characteristic points and its descriptors. The correspondence is done heuristically.
2.
The RANSAC algorithm and epipolar constraints are used to check the validity of points correspondence.
3.
The camera position is found from the valid correspondences, which are used to provide the equations that relate the 3D coordinates with the coordinates of 2D images.
In the map construction process:
1.
The depth measurements are used to calculate the points 3d position, which are mapped into the global coordinate system.
2.
Overlapping points are checked in the map, and the consecutive frames are processed.
2.7 Omnikinect
Reference: [18]