Processing chain for 3D histogram of gradients based real-time object recognition

3D object recognition has been a cutting-edge research topic since the popularization of depth cameras. These cameras enhance the perception of the environment and so are particularly suitable for autonomous robot navigation applications. Advanced deep learning approaches for 3D object recognition are based on complex algorithms and demand powerful hardware resources. However, autonomous robots and powered wheelchairs have limited resources, which affects the implementation of these algorithms for real-time performance. We propose to use instead a 3D voxel-based extension of the 2D histogram of oriented gradients (3DVHOG) as a handcrafted object descriptor for 3D object recognition in combination with a pose normalization method for rotational invariance and a supervised object classifier. The experimental goal is to reduce the overall complexity and the system hardware requirements, and thus enable a feasible real-time hardware implementation. This article compares the 3DVHOG object recognition rates with those of other 3D recognition approaches, using the ModelNet10 object data set as a reference. We analyze the recognition accuracy for 3DVHOG using a variety of voxel grid selections, different numbers of neurons (Nh ) in the single hidden layer feedforward neural network, and feature dimensionality reduction using principal component analysis. The experimental results show that the 3DVHOG descriptor achieves a recognition accuracy of 84.91% with a total processing time of 21.4 ms. Despite the lower recognition accuracy, this is close to the current state-of-the-art approaches for deep learning while enabling real-time performance.


Introduction
Over the last decade, object recognition through visual cameras has been a fundamental computer vision research question. The introduction of consumer depth cameras in recent years has led to an extension of computer vision from 2D to 3D data, thus enabling a real-world visual perception. Methods for object recognition therefore need to be extended to extract 3D object shapes and volumetric features. In addition to 2D data, 3D data can provide geometrical information and true distance measurements for the objects, and ideally is insensitive to illumination variations. Therefore, 3D data can be used to improve the overall performance compared to 2D.
This study was motivated by the desire to develop a contactless control of a powered wheelchair using the caregiver's position as a reference to drive the wheelchair in a side-by-side procession. Hence, the powered wheelchair requires to measure relative distances (d) from the surrounding objects while at the same time recognizing the caregiver from any other objects ( Figure 1). A depth camera is the most suitable selection for the camera system. However, although depth data processing and 3D object recognition are simple tasks for human perception, they are a huge challenge for computer vision due to limitations of the 3D image data acquisition and the computational power required for real-time 3D data processing. 1 These limitations must be evaluated in advance to choose a proper depth data processing approach.
Deep learning is a cutting-edge approach for object recognition which tries to imitate the human learning behavior by extracting information directly from the raw images. This requires complex algorithms such as convolutional neural networks (CNNs), to extract and classify a hierarchy of increasingly abstract features to detect and recognize the objects in the scene. Despite their good performance, CNNs pose high requirements on computational and memory resources, especially for large data amounts as in 3D point clouds. Unfortunately, powered wheelchairs and autonomous robots have severe constraints in terms of power, space, heat dissipation, and hardware resources, 1 meaning that real-time CNN implementations are unfeasible for our application.
As an alternative to CNNs, the use of handcrafted features is the classic computer vision approach for object recognition. The idea is to extract different features from the raw image to generate an object descriptor, after which a supervised classifier learns patterns from the descriptor to estimate the object's class. The computational requirements for this approach depend mainly on the total number of features (N Features ) in the descriptor to be processed by the classifier. We expect a lower N Features to require less computational resources than CNN approaches, and thus to be plausible for implementation in real-time robotics applications. However, reducing the N Features before the classification can lead to a performance decrease, and so a balance between performance and N Features is required.
In this article, we evaluate a 3D handcrafted object descriptor that was developed by Dupre and Argyriou 2 as an extension of the original 2D histogram of oriented gradients (HOG) 3 to support volumetric 3D data (3DVHOG). This descriptor is applied in combination with a supervised support vector machine (SVM) or a single hidden layer feedforward neural network (SLFN) classifier for 3D object recognition. The scientific contribution of this article is to explore firstly the 3DVHOG descriptor for 3D object recognition 2 and secondly the combination of data preprocessing, post-processing, and classifier settings to reduce both the computational cost and the power requirements while balancing the classification performance. Our study therefore provides the base information required to enable implementation in an embedded system for robotics and real-time applications.
We analyze the effect of reducing the extremely high dimensionality of the 3DVHOG features by applying principal component analysis (PCA) as well as the effect of choosing different numbers of hidden neurons (N h ) in an SLFN classifier. We use the Princeton ModelNet10 data set 4 with volumetric images of 10 different object classes as a reference to train, validate, and test the overall data processing steps and also to compare our recognition rates with those of others. Despite targeting an embedded system to detect the caregiver in the end, we perform our analysis on principal component (PC) hardware at this stage. We do this to compare the performance of the found processing chain in general with other object recognition approaches based on the ModelNet10 data set. In the future, we plan to evaluate the system using real data and focus on the caregiver detection. Besides that, the proposed method is not limited to embedded systems and the caregiver detection. It is rather applicable to any other object recognition task as well.

Related works
The extensive existing work on 3D object recognition uses several different approaches that can generally be classified in terms of input data type: (1) RGB-D data approaches, (2) multi-view CNN (MVCNN) approaches, (3) volumetric CNN approaches, and (4) handcrafted 3D object descriptors.

RGB-D data approaches
Depth cameras such as Microsoft Kinect provide an additional 2D parallel output channel to RGB to encode the depth information (RGB-D). RGB-D approaches, then, extend the 2D image architecture to four channels by adding the depth information for each camera pixel (2.5D). Due to the popularization of depth cameras along with the possibility to using well-known 2D image recognition frameworks, there is a very extensive body of research using RGB-D approaches for 3D object recognition in combination with 2D handcrafted object descriptors 5,6 or 2D CNN approaches. 7,8 However, RGB-D is still a 2D image and so does not fully exploit the complete 3D volumetric information of the objects. We believe that using the entire 3D information will provide better results than the 2D recognition approaches to RGB-D data processing.

MVCNNs approaches
MVCNN approaches transform 3D object recognition into a series of 2D image recognition tasks by rendering each 3D object from different 2D viewpoints and extracting 2D features for each image projection. MVCNNs uses the same well-developed CNN recognition frameworks for 2D images. [9][10][11][12] However, in comparison with volumetric approaches, MVCNN approaches have a lower feature dimensionality, are more efficient to compute, and are more robust against noise and artifacts such as holes. 13 Thus, they are more suitable for real-time applications and noisy camera data. A 2D cylindrical panoramic projection (DeepPano) 14 enables rotation invariance and achieves a 88.66% accuracy on the ModelNet10 data set. Su et al., 13 instead, used an MVCCN approach including 80 rendered object views, achieving a maximum recognition accuracy of 90.1% on the ModelNet40 data set. Johns et al. 15 extended the idea of MVCCN to use generic multi-view camera trajectories and achieved a maximum recognition accuracy of 92.8% on the ModelNet10 data set. Sfikas et al. 16 create an augmented panoramic view by a concatenated spatial and orientation domains to create an augmented panoramic view to feed a CNN, achieving a recognition accuracy of 91.1% on the ModelNet10 data set. Finally, Yavartanoo et al. 17 proposed a cutting-edge multi-view approach (SPNet) that achieved a recognition accuracy of 97.25% on the ModelNet10 data set. This approach uses a stereographic mapping to project the 3D surfaces onto a 2D planar image and has lower processing and memory requirements than other recognition approaches. However, it requires the use of a powerful graphical processing unit (GPU) that involves high power consumption and high heat dissipation. Hence, it is not suitable for robotics or powered wheelchair applications due to their real-time operation and hardware constraints.

Volumetric CNN approaches
Volumetric approaches extract 3D volumetric features through a CNN directly from the 3D data, thus exploiting the complete 3D geometry of the objects without including additional 2D features. The object's data is preprocessed through a voxelization processing step, and then the pointcloud data of each object is converted into a uniform 3D grid of binary voxels. 18 RGB-D data can be preprocessed to convert the depth information to a point-cloud representation and then voxelized, 19 and so this approach is suitable for depth cameras. The first volumetric approach to be proposed was the 3DShapeNets, 4 which represents the 3D mesh as a probability distribution of binary voxels. 3D shape distributions are learned by a five-layer convolutional deep belief network. This approach achieves a recognition accuracy of 83.5% on the ModelNet10 data set. Hegde and Zadeh 20 then proposed the FusionNet CNN volumetric approach that uses up to two CNNs in combination with the AlexNet-based 21 MVCNN approach. The three CNN subnetworks are fused to combine multiple data representations and improve the recognition accuracy to 93.1%. Brock et al. 22 presented the state-of-the-art approach, which uses a 45-layer 3D volumetric CNN and a large data augmentation data set for training, achieving a maximum recognition accuracy of 97.25% on the Model-Net10 data set. Despite their good recognition performance, 3D volumetric CNN approaches are large, complex, and highly computational and memory demanding, meaning that none of the above mentioned volumetric approaches are suitable for real-time operation. 18,17 Maturana and Scherer 23 proposed the VoxNet approach, which considerably reduces the number of model parameters. This enables real-time operation while at the same time increasing the recognition accuracy to 92% on the ModelNet10 data set. Qi et al. 24 proposed the PointNet, a real-time CNN for object recognition based on a point density occupancy grids data representation, achieving a recognition accuracy of 77.6% on the ModelNet10 data set. Zhi et al. 18 proposed a real-time 3D object recognition approach called LightNet, which combines the task of subvolume, supervision, and orientation prediction to learn discriminative 3D features from multitask learning. Light-Net achieves a recognition accuracy of 93.94% on the Mod-elNet10 data set.
Uniform voxel grids lead to excessive use of memory and processing resources. Reviewed approaches above use, consequently, a relatively small spatial resolution to map the 3D data onto the volumetric uniform grid of voxels, typically 30 3 voxels. Therefore, image resolution is not comparable to that of a 2D camera. Larger grids would lead to intractable processing and memory requirements, as these increase cubically with the resolution. 25 Small spatial resolutions mean less detailed objects and hence lower recognition accuracies. To solve this problem, it has been proposed that the 3D object shape should be mapped onto adaptively subdivided hierarchical grids, with dense cells near the object's surface. [25][26][27] Although the recognition accuracies achieved with this technique are comparable to those of other approaches, they require less computational power and memory resources. However, they require the use of GPU, and so they are not suitable for real-time operation with strong hardware and power dissipation constraints.

Handcrafted descriptors
Handcrafted descriptors are the classical approach to object recognition. The idea is to extract a set of features from the object to generate an object descriptor that can be used as an object signature. By training a supervised classifier, it is possible to learn from the descriptors to identify a pattern regarding the object's class. This 2D classical approach can also be applied to 3D data by using specific 3D object descriptors. The efficiency of the approach relies on the effectiveness of the descriptor in capturing the object class information. A large number of handcrafted approaches exist. [28][29][30] Generally, handcrafted descriptors can be divided into local and global features. 31 Local feature descriptors capture key points of the object's shape and so are focused on the shape around several key points. They tend to be more computationally expensive, and thus are not suitable for real-time robotics applications with high limitations in terms of hardware and heat dissipation. The most popular local descriptors are spin images, 32 fast point feature histograms, 33 and 3D SURF. 34 Conversely, global feature descriptors capture shape information using the overall appearance of the object. They are increasingly used in object recognition, object manipulation, and geometric characterization. They are efficient in terms of computation time, thus allowing real-time performance. 35 Uses in 3D object shape recognition include ensemble of shape functions, 36 global fast viewpoint feature histograms, 33 and 3DVHOG. 2,37 However, they ignore the local object's details, leading to lower performance. In both of the above mentioned 3DVHOG implementations reviewed, 2,37 the level of local object detail is configured by different 3DVHOG parameters and thus can be modified according to the recognition requirements. However, there is a compromise between the local detail level, the descriptor feature dimensionality, and the elapsed processing time. Analysis of this compromise is the intent of the present article.

Method
Our goal was to evaluate the 3DVHOG handcrafted object descriptor to reduce the computational cost as much as possible compared to deep learning approaches for 3D object recognition tasks. Our proposed processing steps are summarized in Figure 2.

Data preprocessing
We chosen the Princeton ModelNet10 data set as a common reference to validate our classification results. This data set includes a volumetric 3D representation of 10 different object classes (N Classes ) with separate data sets for training and test (cf. Table 1 and Figure 3). The 3D volumetric objects are first scale normalized to [0, 1] and then quantized in a mesh grid of cubic cells, also called voxels, as a preprocessing step. The number of quantized voxels (N Voxels ) for each object is an input parameter that can be  modified to include more detailed object information into the analysis.

Pose normalization
We propose an additional data preprocessing step to achieve rotation invariance in the object classification. Rotation invariance is crucial for detecting objects whose pose is rotated according to the camera position. As a HOG-based descriptor, the 3DVHOG is not rotation invariant. Thus, when an object is rotated the 3DVHOG descriptor changes, making it impossible for the classifier to estimate the correct object class. To solve this, we have proposed a pose normalization method based on the PCA pose normalization in combination with the standard data deviation (PCA-STD). 38 To include the pose normalization preprocessing step into the recognition results, we first rotate each test data set object randomly along the threeaxis and later we normalize its pose using the PCA-STD method.

Feature extraction
The 3DVHOG object descriptor 2 was originally developed for environment hazard detection and risk evaluation by detecting the presence of dangerous 3D objects in the 3D scene. Here, we instead use it as a general object descriptor to extract volumetric features from the objects. Like the 2D implementation of the HOG, there are some input parameters to configure the descriptor, but in this 3D implementation they are extended to a 3D representation and hence include the number of angle bins, (q Bins , ' Bins ), the step size (Step Size ), and the cell size (Cell Size ). The total number of blocks (N Blocks ) (1) and the total number of features (N Features ) (2) for each object depend on the configuration of these parameters and are calculated as follows Step Size ð1Þ with the number of cells per object (N Cells ) calculated as It is important to evaluate the impact of each parameter to choose a proper descriptor setup that minimizes the required N Features while at the same time maximizing the classification accuracy.

Data postprocessing
The different N Blocks are vectorized as single vector descriptors of dimension N Features . All the vector descriptors of each object are then reshaped into a matrix of features as illustrated in Figure 4. Depending on the number of angle bins (q Bins *' Bins ) and the N Cells , the final vector descriptor of N Features , and by extension the feature matrix, can have an extremely high dimensionality. This high dimensionality limits the real-time operation of the overall processing steps and will also require more computational and memory resources to process all the data.
We therefore apply the PCA as method to reduce the feature dimensionality. The final number of recomputed features depends on the number of principal components (N PC ) used to project the original data ( Figure 5). Therefore, the minimum N PC must be evaluated to maximize the classification rate while at the same time reducing the required N Features as much as possible, see equation (4). Once N PC is determined, we project the 3DVHOG descriptors from the test object's data set onto the PCs. The resulting reduced 3DVHOG descriptor has a lower feature dimensionality The maximum N PC is limited by the total number of objects in the training data set (N Objects ¼ 3991; see Table 1). Thus, it is not possible to reduce the initial N Features in more than N Objects À 1. If further reduction is required, it will be necessary to increase the N Objects in the training data set by performing a training data set augmentation.

Object classifiers: SVM and SLFN
Information regarding the object classes must be extracted from the feature vector. Once we have learned this information from the training data set, it is possible to estimate the class of a new input object. The learning process is performed by training a supervised multiclass classifier. We have evaluated two different classifiers: an SVM and an SLFN. We expect different classification results when they work in combination with 3DVHOG and PCA, and an evaluation of this was one of the aims of the present study. Configuration and data parameters of the classifiers are shown in Table 2.
Regarding the N h used in the SLFN classifier, there is no specific rule to choose a proper N h that maximizes the classification accuracy. 39 As shown in Table 4, we need to deal with extremely large N Features vectors, and therefore choose the "Shibata and Ikeda" criteria 40 (N hSI ) (5) to obtain a lower N h with respect to N Features However, we also evaluated N h as a design criterion to minimize the elapsed processing time and measure the dependence of the classification accuracy on N h . We selected N h criteria according to N Features . Other criteria,  which involve a higher N h and thus higher memory and computing requirements, were not considered due to the real-time performance and hardware constraints.

Experiments
We defined several experiments to evaluate the effect of the different preprocessing and postprocessing parameters on the classification accuracy ( Figure 2). The order of these experiments follows the logic design flow that must be considered for a proper parameters configuration and data analysis, (Figure 6).

Experiment 1: Voxel grid and PCA
In experiment 1, we analyzed the effect of choosing different voxel grid configuration parameters while reducing the N Features dimensionality using PCA. We used the classification accuracy as a key measurement for both classifiers considering pose normalization for rotational invariance. Classification accuracy was defined as the averaged class accuracy (ACC Class ) where T p C are the class C true positives, T n C the class C true negatives, F p C the class C false positives, F n C the class C false negatives, and N the number of classes. Increasing the voxel grid will provide more detailed volumetric information about the objects but can also increase the differences between objects of the same class. By contrast, when the voxel grid is decreased, the objects are less detailed but the differences between objects of the same class are smaller. Smaller intraclass differences make it easier for the classifier to extract a pattern from the feature matrix and, hence, can increase the classification performance. However, this can also cause smaller differences between objects of different classes, thus decreasing the classification performance. It is therefore necessary to find a combination of the lowest voxel grid value, the minimum required N PC , and the right classifier approach to improve the classification accuracy while reducing as far as possible the amount of data that requires processing In line with the literature and the total N Features , we chose a voxel grid with sizes ranging from [20 3 to 40 3 ] voxels for the experiment. The 3DVHOG initial configuration parameters (Cell Size , Block Size , Step Size , q Bins , ' Bins ) were the same for each voxel grid value, (Table 3) and so the total depends only on the voxel grid in each configuration.    A higher voxel grid means a higher N Features (Table 4), which in practice can cause difficulty for the SVM and SLFN classifiers due to the high dimensionality of the feature matrix. A higher dimensionality leads to higher processing times and memory requirements, making it unsuitable for real-time operation. However, postprocessing the feature matrix with PCA reduces N Features to N PC , see equation (4).
Classification accuracy, standard deviation, and data variance after all the preprocessing and postprocessing steps illustrated in Figures 2 and 5 for each voxel grid case are shown in Figure 7 for the SVM and Figure 8 for the SLFN. These results were calculated by averaging 10 different measurements. The figures also show the classification accuracy without applying PCA.
Classification accuracy was improved for both classifiers when PCA was applied. The maximum value was achieved by using approximately 100 PC in both cases (SVM: Figure 7, SLFN: Figure 8). The improvement was significantly better for the SVM classifier, which achieved a maximum classification accuracy of 85:5%. Both classifiers performed better with the highest voxel grid analyzed (40 3 ), but this was marginally better than using a grid of 30 3 voxels. A higher voxel grid increases the size of N Features vector considerably (Table 4) and consequently also the memory requirements and processing time. We would therefore choose a 40 3 voxel grid in combination with 100 PC to improve the recognition accuracy, but we would also consider a 30 3 voxel grid for a real-time application while maintaining an acceptable recognition accuracy rate.

Experiment 2: Explore 3DVHOG bins
In experiment 2, we evaluated the impact of using different q Bins and ' Bins to compute the 3DVHOG descriptor considering pose normalization for rotational invariance. As with the voxel grid, if we increase q Bins and ' Bins , then the 3DVHOG will capture more detailed information about objects but it will be harder for the classifier to extract a class pattern from the features matrix. In addition, the total N Features and, by extension, the total size of the feature matrix is particularly dependent on the product of q Bins and ' Bins as is shown in equation (2). Therefore, it was necessary to evaluate the impact of q Bins and ' Bins in combination with the previous PCA results to reduce the total N Features and therefore the computational requirements as far as possible. However, overfitting ' Bins and q Bins may complicate the classification process. In addition, overfitting ' Bins and q Bins quadratically increases the N Features vector length (2) ( Table 5). Note that ' Bins is defined between 0 and 360 and q Bins between 0 and 180 , meaning that ' Bins requires twice as many angle bins as q Bins to sample an angle with the same resolution. Minimum values correspond to ' Bins ¼ 4 and q Bins ¼ 2 to measure gradients   in all the 3D directions. The goal of this experiment was to find the lowest combination of ' Bins , q Bins , and N PC that can maximize the classification rate. Experimental results for all the angle bins and voxel grid cases defined in Table 5 are shown in Figure 9 for the SVM and SLFN classifiers. Classification accuracy and standard deviation of the measurements were calculated by averaging 20 measurements. For SVM and SLFN classifiers, the recognition accuracy remains constant with respect to the number of angle bins. Thus, a low number of angle bins is enough for capturing the object class information. In addition, the SVM classifier performs better than SLFN for all analyzed cases. The maximum recognition accuracy achieved is 85:84% corresponding to ' Bins ¼ 10, q Bins ¼ 5 and grid of 40 3 voxels. Only for the lowest angle bins case (' Bins ¼ 4 and q Bins ¼ 2), the recognition accuracy is marginally lower for both classifiers and voxel grids. However, the difference in terms of recognition accuracy with respect the best case is small while the N Features are reduced significantly (cf. Table 5).

Experiment 3: Explore numbers of hidden neurons
In experiments 1 and 2, we used the N hSI criteria (5) to choose the proper N h for the SLFN classifier. We also used PCA for feature dimensionality reduction and aimed to determine the best combination of PC and angle bins to improve the recognition accuracy. In experiment 3, we investigated the possibility of using a lower N h as a method of reducing the processing time instead of using PCA, and so the data postprocessing indicated in Figure 2 was excluded. The experimental goal was to evaluate the minimum required N h to improve the SLFN classification accuracy. We hypothesized that lower N h would allow us to reduce the SLFN classification processing time due to the lower computer processing requirements and avoidance of the need to compute the PCA. Averaging results and data variance of 10 different measurements are shown in Figure 10 for a 40 3 and 30 3 voxel grids with ' Bins ¼4 and q Bins ¼ 2.
As shown in Figure 10, SLFN classification accuracy was invariant to the N h used. Thus, the results of this experiment agree with those of experiment 2. However, a lower N h (N h < 20) significantly decreased the classification accuracy. The best results in terms of classification accuracy and standard deviation were achieved by using N h ¼ 30; Increasing N h beyond this did not change the classification accuracy but the standard deviation of 10 different measurements increased.

Experiment 4: Pose normalization
In experiment 4, we evaluated the pose normalization preprocessing step shown as the second operation in Figure 2. This pose normalization achieves rotational invariance for the 3DVHOG descriptor using the PCA-STD method defined by Vilar et al. 38 Performance is evaluated in terms of averaged ACC Class of 20 different measurements as a key measurement. For the evaluation, we rotated all objects in the test data set randomly between 0 and 360 for the three axes. The results were then compared without considering pose normalization and rotated objects.
An initial evaluation without pose normalization showed a recognition accuracy of 88:48% for nonrotated test data set objects but only 43:25% when the objects were rotated, due to the pose dependency of the 3DVHOG descriptor. When pose normalization was included as in Figure 2, the recognition accuracy increased to 84:91% for rotated objects while for nonrotated objecs, the recognition accuracy was 84:61% and thus comparable with that for rotated objects. These results are summarized in Table 6.

Experiment 5: Explore processing time
In experiment 5, we measured the mean value of the elapsed processing time for the overall object recognition chain. This experiment compared all the processing durations according to the 3DVHOG parameter configuration, pose normalization, and also evaluated the elapsed processing time improvements achieved by using PCA and a lower N h . Experiments 1 and 2 gave us information on the best 3DVHOG parameter configuration to find a good balance between classification accuracy and required N PC and N h . According to the previous results, we choose N PC ¼ 100, N h ¼ 30, ' Bins ¼ 4, and q Bins ¼ 2 to reduce as much as possible the computational cost and therefore enable realtime performance.
With these preprocessing and postprocessing parameter configurations, we trained the SVM and SLFN classifiers to estimate the object class from the test data set objects as shown in Figure 11. We computed, object-by-object, the elapsed processing time for the 3DVHOG descriptor computation (t HOG ), data projection onto the PC (t PC ), and data classification through the trained SVM and SLFN classifiers (t CLASS ). Total elapsed time for all processing chain steps was also computed (t Total ). The required time for training the SVM and SLFN classifiers was not considered in this analysis, since training is an offline data processing operation. Processing times were measured in a low-range laptop computer with the specifications shown in Table 7.
Although these measurements are not from an embedded system, they provide a valuable reference to qualitatively compare and analyze the different processing times. As such, our analysis gives an early insight into expected processing to select an appropriate variant for later implementation in the real embedded system. Summarized mean values of the different processing times are shown in Table 8 for a 30 3 voxel grid and in Table 9 for a 40 3 voxel grid.
The measured t N and t PC were relatively small and almost negligible in comparison with the t Class and t HOG . By using N PC and N h according to the results in experiments 1 and 3, it was possible to considerably reduce the t Class for the SVM and the SLFN, while the classification accuracy was still comparable.

Object classification analysis
The confusion matrix for the highest recognition accuracy analyzed in Tables 8 and 9 is shown in Table 10. Most of the object classes were relatively well classified, but there was clear misclassification between 4 (desk) and 9 (table)     and between classes 5 (dresser) and 7 (nightstand) ( Table 1).

Comparison with previous results
The 3DVHOG descriptor in combination with PCA-STD pose normalization and PCA dimensionality feature reduction achieved a recognition accuracy of 84.91% and a t Total of 21:6 ms (Table 9). This recognition accuracy is lower than but close to the state-of-the-art approaches shown in Table 11. In addition, our approach enables frame rates of 46 fps and thus real-time performances without the need for a GPU. By contrast, all the reviewed approaches use CNN and GPU and so are not suitable for robotics applications where power consumption, real-time processing, and limited hardware resources are important constraints. Despite the fact that some of the CNN approaches achieve better results in general, our results show that handcrafted descriptors can also be considered, especially when realtime processing and low power are required.

Classification analysis
As shown in Table 10, there was substantial misclassification between classes 4 (desk) and 9 (table) and 5 (dresser) and 7 (nightstand) ( Table 1). These classification errors considerably decreased the final recognition accuracy. The main reason for the misclassified objects is the high similarity between classes ( Figure 12). The 3DVHOG descriptor cannot capture enough local detailed information from the classes to allow the classifier to differentiate between them. This problem can be solved by using a higher voxel grid to capture more local detailed information. However, increasing the local detailed information can also reduce the similarities between objects of the same class and consequently reduce the overall recognition accuracy. In addition, a higher voxel grid leads to a considerable increase in N Features (Table 4), and hence also in both t Total a memory requirements. By contrast, increasing the voxel grid from 30 3 to 40 3 voxels (cf. Figures 7 and 8) leads to a relatively small increment in the recognition accuracy. Therefore, we believe that higher voxel grid values will not significantly increase the recognition accuracy.

Experimental design flow
The results from this study can guide us in choosing the best classifier, voxel grid, and configuration of parameters N features and N h . However, the chosen configuration may be optimal only when using the ModelNet10 data set. We can expect any optimal configuration of these parameters to depend mainly on the degree of similarity between objects from the same class and from different classes. A higher    degree of similarities between objects from the same class in comparison to objects from different classes will require less N Features to estimate the class to which the object belongs. This means that the experimental results depend largely on the data set. As consequence, the reported system exploration illustrated in Figure 6 must be repeated if the same chain of data processing operations is to be reused on ground-truth data for the wheelchair application.

Conclusions
Experimental results show that the 3DVHOG descriptor in combination with PCA-STD pose normalization achieves a classification accuracy of 84:91% and total processing time of 21:6 ms on the ModelNet10 data set. The accuracy is lower, but close to, the accuracy achieved by state-of-theart CNN approaches, while real-time processing is enabled on low-range CPUs. Thus our approach is suitable for embedded robotic vision constrained by requirements of low power and real-time performance. Less tightly constrained applications can instead benefit from using CNN to achieve the highest possible accuracy.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.