Depth-based human activity recognition via multi-level fused features and fast broad learning system

Human activity recognition using depth videos remains a challenging problem while in some applications the available training samples is limited. In this article, we propose a new method for human activity recognition by crafting an integrated descriptor called multi-level fused features for depth sequences and devising a fast broad learning system based on matrix decomposition for classification. First, the surface normals are computed from original depth maps; the histogram of the surface normal orientations is obtained as a low-level feature by accumulating the contributions from normals, then a high-level feature is acquired by sparse coding and pooling on the aggregation of polynormals. After that, the principal component analysis is applied to the conjunction of the two-level features in order to obtain a low-dimensional and discriminative fused feature. At last, fast broad learning system based on matrix decomposition is proposed to accelerate the training process and enhance the classification results. The recognition results on three benchmark data sets show that our method outperforms the state-of-the-art methods in term of accuracy, especially when the number of training samples is small.


Introduction
Human activity recognition (HAR) is a research hotspot in the field of computer vision and pattern recognition with wide applications such as intelligent video surveillance, 1 human computer interaction, 2 ambient assisted living, 3 virtual reality, 4 and so on. Early researches on HAR have mainly focused on recognizing activities from RGB videos and many successful approaches have been proposed. 5,6 However, the effectiveness of RGB cameras deteriorates because of illumination changes, surrounding clutters and disorder. 7 The inventions of the cost-effective depth sensors such as Microsoft Kinect and Asus Xtion Pro have triggered the imagination of many researchers about activity recognition. Abundant structure information captured by depth sensors is insensitive to illumination variations, robust to complex background, and valuable for obtaining geometric information. 8 Many researches have been carried out with depth maps for HAR. [9][10][11] However, when the depth training samples are limited because of the cost-intensive collection of depth data of human activity, most of these methods cannot achieve required accuracy due to weak descriptors and rough classifiers.
Current researches have investigated a number of human body representations including skeleton joints, 12 cloud points, 13 local interest points, 14 projected depth maps, 15 and surface normals. 10 In Luo et al., 16 a skeleton-based discriminative dictionary learning approach is proposed though utilizing group sparsity and geometry constraints. Vemulapalli et al. 12 took skeletons as points and actions as curves in a Lie group by using the three-dimensional (3D) relative geometry between body parts. However, skeletons are usually noisy due to the difficulty in localizing body parts, selfocclusions, and sensor range errors. 17 In contrast to skeleton joints, cloud points are more robust to occlusions and noise. In Wang et al., 13 local occupancy patterns (LOP) were designed to subdivide the local 3D subvolumes related with skeleton joints into a group of spatial grids, then the number of cloud points falling into each grid was calculated. Rahmani et al. 18 designed the histogram of oriented principal components (HOPC) to capture the local geometric information around each point in 3D cloud point videos, which is robust to viewpoint, scale, and temporal variations. To effectively suppress noise in the depth sequences, local spatiotemporal interest points (STIPs) were extracted from depth videos by a delicate filter to find out task-related interest points in Xia and Aggarwal. 14 In order to transform the depth data from 3D to twodimensional (2D), Yang et al. 19 proposed depth motion maps (DMM) generated by projecting the depth maps onto three orthogonal planes and thresholding the difference of consecutive depth frames for each projected view, then applied the histogram of oriented gradients (HOG) to each 2D projected view to extract the features. However, the DMM features employed in Chen et al. 15 and Yang et al. 19 cannot capture the temporal information and thus suffered from the temporal disordering.
Surface normal has already been proved that it can extract valuable shape and structure information from depth maps. 20 In Oreifej and Liu, 10 histogram of the surface normal orientation in four-dimensional (4D) space (HON4D) in terms of time, depth, and 2D viewing planar was designed to capture the complex joint shape-motion cues at pixel level. Although as a lowlevel feature, it can capture motion and geometry cues effectively while being robust to occlusion. Yang and Tian 21 proposed a new and high-level representation called super normal vector (SNV) by aggregating the low-level polynormals and concatenating the feature vector extracted from each adaptive spatiotemporal grid to encode spatiotemporal information. SNV is robust to noise; it can not only capture spatial and temporal order but also provide more distinguished local motion and appearance information for complex activities. In this article, data fusion is employed to gain robust and discriminative features to take the advantages of the both above features.
In recent years, deep learning methods have been widely used to automatically learn features from raw data and make successful computer vision applications, 22,23 especially in HAR. [24][25][26] Nonetheless, deep learning methods usually need large-scale training set, which is difficult because of economic and technical limit. Some recent works exploited transfer learning [27][28][29] to deal with the lack of training samples. Nevertheless, the chosen parameters and models of deep learning methods still remain a challenging problem.
Broad learning system (BLS) 30 is proposed as an improvement of the random vector functional link neural network (RVFLNN). 31,32 Compared with the deep schemes of neural network models, RVFLNN dramatically reduces the training time and provides comparable generalization ability through combination of random functions. In BLS, the mapping features generated from the input data form the feature nodes of the network, then they are enhanced as enhancement nodes (EN) by randomly generated weights. Finally, all mapped features and EN are directly connected to the output, the corresponding output coefficients can be derived from pseudoinverse. 30 BLS has been successfully applied to some specific image classification tasks, 30,33 which outperformed common classifiers with limited labeled samples, such as k-nearest neighbor (KNN), 34 support vector machine (SVM), 35 and extreme learning machine (ELM). 36 However, most of the real problems relevant to regression and classification are complex and need very broad-scale feature nodes, leading to extremely long training time. Therefore, we proposed a fast BLS based on matrix decomposition (FBLS-MD) to resolve this problem.
To handle the HAR tasks with limited training samples, in this article, we develop a robust descriptor of depth sequences called multi-level fused features (MLFF). In order to fully explore the MLFF's validity, a FBLS-MD is further proposed. MLFF are generated by concatenating two different features extracted from low and high levels respectively. Since the low-level feature can capture the statistical patterns of shape changes of human activities while the high-level feature gives out a more comprehensive representation of spatial and temporal variations, MLFF are strongly robust to noise and occlusion. Principal component analysis (PCA) is further adopted to obtain lower dimensional features. Finally, FBLS-MD is carried out to classify activities and relieve the problem of heavy computation. The main contributions of our work are listed as follows: (a) We propose a new descriptor which is more discriminative and effective due to the complementarity between low-level and high-level features. (b) It is the first trial that introduces BLS into HAR classification. The proposed FBLS-MD can relieve the time-consuming training process caused by the large number of nodes.
The rest of this article introduces the detailed framework of our proposed method in the ''Proposed method'' section. In the ''Experiments'' section, we conduct experiments on three well known data sets and analyze the results. Finally, the ''Conclusion'' section summarizes the article.

Proposed method
Our proposed method has two major steps. First, MLFF of depth sequences are acquired by concatenating HON4D and SNV; then we employ PCA for dimensionality reduction. Second, FBLS-MD algorithm is introduced for efficient training and classification. The overview of our method is shown in Figure 1.

MLFF
All features in our work are calculated in 4D space (i.e. 2D images (x, y), depth (z), time (t)). The steps to obtain the MLFF are illustrated in Figure 2.
Feature extraction. First, we extract HON4D in the same way as in the study by Oreifej and Liu. 10 Depth sequences of human activities can be viewed as a hypersurface in 4D space R 3 ! R 1 : z = f (x, y, t), where the set of points (x, y, t, z) constitute a surface satisfying S(x, y, t, z) = f (x, y, t) À z = 0; therefore, the normal of surface S is  n is computed by normalizing n. Next, the 4D space is quantized by the 120 vertices of the 600-cell polychoron which divided it regularly. Then, the HON4D descriptor is calculated by accumulating the projection of unit normalsn onto the 120 vertices. As to further use the spatial-temporal information of a depth sequence, it is divided into 4 3 3 3 3 spatial-temporal subvolumes; the final descriptor is a concatenation of the HON4Ds acquired from all subvolumes.
Second, we extract the aggregated spatial-temporal features based on an improved spatial-temporal pyramid as in the study by Yang and Tian. 37 The method generated the polynormal by clustering normals from a local spatiotemporal neighborhood to form the highlevel features. N normals in the local neighborhood S of each cloud point are concatenated by a polynormal p, which is denoted as The neighborhood S is a spatiotemporal depth subvolume determined by two parameters r and t, where r denotes the number of neighboring points in space and t indicates the number of neighboring depth maps in time series.
Then sparse coding 38 is utilized to find a set of dictionary vectors encoding polynormals. And then the average pooling is applied spatially to aggregate the coefficient-weighted differences where u k (t) represents the pooled difference vector of the kth visual word d k of the tth frame, a k, i is the coefficient of sparse decomposition. Then the max pooling is used in temporal subsequence to aggregate the vectors from T frames and obtain u k which represents the kth visual word in the whole volume. The final vector U is the concatenation of the u k vectors from the K visual words, which is K 3 M dimensions.
In order to exercise energy and characterize movement changes accurately, the tth frame M i is projected onto three orthogonal planes to acquire the projected maps M i v , v 2 1, 2, 3. Then, the difference between two consecutive maps projected on the three planes is binarized with a specified threshold. We calculate the motion energy by accumulating the sum of the non-zero elements of the 2D graph as where E(i) is the motion energy of the ith frame; e v is the threshold of the tth projected map; sum(Á) returns the total number of non-zero elements in a binary map.
The motion energy of the ith frame is the energy superposition of the first frame to the ith frame.
In order to obtain information in the spatial dimensions, each frame is divided into h 3 w blocks, and the entire activity is divided into T levels to obtain the information of the time dimension. Eventually each activity sequence is divided into h 3 w 3 (2 T À 1) blocks. This article uses a three-level pyramid in the time dimension: ft 0 t 4 g, ft 0 t 2 , t 2 t 4 g, ft 0 t 1 , t 1 t 2 , t 2 t 3 , t 3 t 4 g, which is shown in Figure 3.
Finally, the feature U extracted from each grid in the improved spatiotemporal pyramid are concatenated to form the high-level features as SNV. 37 Feature fusion. Feature fusion has been demonstrated to be an effective method to boost the performance in HAR system. [39][40][41][42] And it is usually conducted through feature normalization and feature selection or transformation due to the highly correlated feature set and the curse of dimensionality. 43 In our method, HON4D 10 of one cell has been extracted and the final features are denoted as where N represents the number of spatiotemporal cells.
The final high-level descriptors V can be written as where c indicates the number of space-time grids from the spatiotemporal pyramid. We mark the normalized H and V asH andṼ , then we concatenate the above two features and obtain F = ½Ṽ ,H. Feature fusion usually produce representations in a higher dimensional space. Although pooling can eliminate data redundancy, its dimensionality reduction usually is a by-product rather than a direct goal. PCA 44 is useful for dimensionality reduction, increasing interpretability and as the same time minimizing information loss; it can maximize variance by creating new uncorrelated variables. PCA has become an adaptive data analysis technique. Therefore, we employ PCA to reduce the dimension of the features thus improving the efficiency of the algorithm in this article. Finally, we can get the MLFF as the representation of a depth sequence.
The MLFF descriptors have the following obvious advantages. (1) Our descriptors are more robust and discriminative than previous representations. (2) With a lower dimension, MLFF can greatly improve the running speed of the algorithm while also increase the recognition rate.

Classification
In order to perform classification with the designed features, we feed the MLFF to FBLS-MD classifier, which can accelerate the training speed by making use of block matrix inversion lemma to decompose the large matrix inversion process. Next, the details of the algorithm will be introduced.
Given the input data set X , which contains N samples, each with M dimensions, the output matrix Y belongs to R N 3 C . The ith mapped features Z i can be obtained according to where W e i is the random weights with the proper dimensions, W e i and b e i are randomly generated. Note that different function f can be chosen for different groups of the mapped nodes (MN). In addition, all the MN are expressed as the set of Z i = ½Z 1 , . . . , Z i , which is represented as a connection of the first i groups of mapping features. Furthermore, the mth group of EN can be written as Finally, the broad learning model can be defined as follows where W m = ½Z n jH m + Y is the connecting weights for the broad structure, and it was computed through the ridge regression approximation of ½ZnjHm + in original BLS which was written as where G = ½Z n jH m , and l is the regular l 2 -norm regularization.
In FBLS-MD algorithm, the connecting weights W m are decomposed into two parts where W 1 2 R L 1 3 m , W 2 2 R L 2 3 m , and L 1 + L 2 = L, L represents the total numbers of the mapped features and EN. Hence, the coefficient matrix G is decomposed into two small matrices as follows Based on BLS algorithm, the weights W can be obtained in the following way Through block matrix inversion Lemma, 45 we can computer formula (14), then the connecting weights W can be written as follows where I represents an unit matrix. Therefore, The FBLS-MD is structured as Figure 4. In summary, the training steps of FBLS-MD algorithm are shown in Table 1.

Experimental setup
The proposed method is extensively evaluated on three benchmark data sets, including MSR Action 3D data set, MSR Hand Gesture 3D data set, and 3D Action Pairs data set. For each activity, we extract its MLFF descriptors.
In the experiments, each video sequence is divided into space-time grids, which are 4 3 3 3 7 in width, height, and frame numbers respectively. The number of the visual words in the process of obtaining high-level descriptors is set to 100, and the size of local neighborhood is 3 3 3 3 3.
We evaluate the performance of our proposed method comparing with the state-of-the-art methods using the same experimental settings in Yang and Tian. 37 For three data sets, s (s = 1, 2, 3, 4, 5) randomly chosen actors' activities are used for training, while the remaining samples are used for testing. The selections of s are conducted randomly five times in each case to get the average results as the final recognition rate. In addition, the performance of FBLS-MD is compared with original BLS in term of the training time when the MN and EN increase gradually, which is verified on whole data set. It is worth noting that the experimental results about BLS and FBLS-MD algorithms are acquired by taking the average of 10 results. Our experiments are all performed using MATLAB on a computer with a 3.60 GHz Intel Core i7-4790 CPU and 16 GB RAM.
Experimental results and analysis MSR Action 3D data set. MSR Action 3D 46 is one of most classical data sets for HAR as recorded in related research literatures. It includes 20 different actions. Each action is performed by 10 actors for two or three times. Inevitably, there are some missing and wrong depth sequences. It is a challenging data set for HAR due to the similar actions. The specific actions in this data set are shown in Figure 5.
With the same experiment setup as in Wang et al. 46 (first five actors for training, and the rest for testing), we compare our results with the state-of-the-art methods on this data set and present the results in Table 2. This setting is much more challenging than which has Randomly generate W ei ,b ei 3: Calculate Z i = f(XW ei + b ei ) 4: end 5: Obtain the feature mapping group Z n = ½Z 1 , . . . , Z n . 6: for j = 1; j ł m do 7: Randomly produce W hj ,b hj 8: Calculate H i = j(ZW hj + b hj ) 9: end 10: Get the enhancement nodes group H n = ½H 1 , . . . , H m .  been used in Li et al., 48 because evaluation on whole action set increases the chance of confusion which often occurs in recognizing similar actions. As the result shows, our method is superior to other classical methods. The confusion matrix is demonstrated in Figure 6. It can be observed that our method has significant improvement on recognizing ''hand catch'' and ''forward punch'' actions by comparing with the results of Yang and Tian. 37 Then, we evaluate the performance of our proposed descriptors. Table 3 compares MLFF descriptors with the single-feature methods on this data set. The results show our descriptors have a more powerful representation than HON4D or SNV. When the training samples are fewer, the improvements of the recognition rate are more observable. MLFF show a significant gain in classification accuracy by nearly 10% when s is 1 or 2. In addition, we find that the standard deviations of our method are smaller than other methods, which means MLFF are more robust.
Next, we verify the validity of FBLS-MD classifier. In the third experiment on this data set, we compare the performances of FBLS-MD classifier with the other four classifiers. The results are shown in Table 4 and Figure 7. From the experimental results with the same feature set and varied classifiers, we can see that the BLS and FBLS-MD turn out to be remarkably good at distinguishing activities. Obviously, only the result of our method exceeds 90% when s is 5.
At the final stage, the performance comparison of BLS and FBLS-MD on MSR Action 3D data set is shown in Table 5 (where Ratio refers to the time reduction proportion). When the feature nodes are small, the reduction of the training time is not obvious. But as the number of mapping nodes increases, the training time shrinks greatly meanwhile maintaining the recognition rate. As can be seen, when the feature nodes reach a certain number, the action recognition rate will reach the highest value and then slightly decrease.
MSR Hand Gesture 3D data set. MSR Hand Gesture 3D 24 consists of 12 dynamic American Sign Language (ASL) gestures captured by a Kinect device. The whole data set contains 333 depth sequences and bears self-occlusions. Each gesture is performed for two or three times   by 10 actors; the depth map size of each gesture is varied. Some samples are shown in Figure 8.
In the experiments, we conducted the leave-one-subject-out cross-validation (LOO-CV) as in the study by Wang et al. 51 to evaluate the performance of our algorithm. Table 6 shows the comparison of our proposed method with the state-of-the-art methods on this data set. We can see that our method has improved 4.49% and 2.19% compared with the single-feature methods in terms of HON4D and SNV, respectively. Moreover, our method outperforms all compared approaches. Furthermore, our method achieves a high recognition rate of 96.05%. The confusion matrix is showed in Figure 9 with the experimental setup as in the study by Wang et al., 46 which refers to the first five actors for training and the rest for testing. Figure 10 gives the confusion matrix constructed through the method in Yang and Tian 37 under the same experimental setup. From the two confusion matrices, it is clear that ''blue,''''finish,''''green,''''hungry,''''milk,''''j,'' and ''z'' gestures are more precisely identified, and the overall recognition rate was increased by nearly 7%.
We compare our proposed method with singlefeature methods. We can find that our method is better than the single-feature methods presented in Table 7 with a large margin in small samples. Comparing with the best of the other methods, our method achieves a 2.59% recognition rate improvement when s = 4 and more than 4% improvement in other cases. Table 8 and Figure 11 show the performance comparison of the five classifiers. Although the result of ELM is slightly higher than that of BLS and FBLS-MD when s equals 3, the overall results show that FBLS-MD significantly outperforms the other classifiers. The performance comparison of BLS and FBLS-MD algorithms on MSR Hand Gesture 3D data set is given in Table 9, which has also shown that FBLS-MD is useful for elevating the training speed.    3D Action Pairs data set. The actions in 3D MSR Action Pair data set 10 are the paired-activities captured by a depth camera. This data set contains 12 activities which were performed by 10 actors with each actor performing three times. Part of them are shown in Figure 12. Every couple of activities has similar movements and shapes. The challenge of this data set is that some activities differ only on sequences' order, such as picking and dropping. Therefore, the temporal order of frames is one of the most important factor in the activity recognition of this data set.
As shown in Table 10, our proposed method outperforms the state-of-the-art methods on this data set. Table 11 indicates that the single-feature methods are still inferior to our method on the third data set. When the training samples are smaller, our method achieves a great improvement. The accuracies of our method are 5.25%, 2.36%, 0.95%, 0.64%, and 0.89% higher than the single-feature method in the best cases when s is 1, 2, 3, 4, and 5, respectively. From Table 11, we can find that HON4D feature is more suitable than SNV feature on 3D Action Pairs data set. Accordingly, this also proves the complementarity of this two features for different data sets.
It is obvious that our method is superior to the other five classifiers as presented in Table 12 and Figure 13.    However, with the third data set, Softmax, SVM, ELM, BLS, and FBLS-MD classifiers have slight differences in classification performance. The differences of the recognition rate are within 3%. Table 13 shows the comparison of BLS and FBLS-MD algorithms' performances on 3D Action Pairs data set. On this data set, as the feature nodes continue to increase, the training time has decreased significantly, simultaneously the recognition rate has reached maximum and no longer drops. It also proves FBLS-MD's value when large-scale feature nodes are required to train models.

Discussion
Our proposed method obtains the highest performance while the training samples are small. It may be attributed to two factors. First, our fused features are    complementary to HON4D and SNV, thus increasing recognition accuracies. Second, the comparison experiments show that FBLS-MD performs very favorably against other commonly used classifiers. In addition, FBLS-MD is demonstrated to effectively shorten the training time in case that the computation burden is increased by a large number of feature nodes. With increasing feature nodes, the computation burden of inverse of the growing matrix will be huger and matrix decomposition will play a vital role. There are large standard deviations for some random experimental results, which is attributed to the large individual differences in the data sets and the missing data in the first two data sets. The adjustable parameters in FBLS-MD include the following: the feature nodes per window, number of windows of the feature nodes, number of EN, the l 2 regularization parameter, and the shrinkage scale of the EN. In the article, we have undergone extensive experiments by making specified some of the parameters unchanged and adjusting the number of MN and EN constantly to get the best recognition results.     The relationships between recognition rate and MN or EN of FBLS-MD on the three data sets are shown in Figure 14, where the first two data sets are verified with the same experimental setting as in the study by Wang et al. 46 and the third data set with the set as in the study by Oreifej and Liu. 10 The result demonstrated that (a) As MN and EN increase, we can see that the recognition results also boost. However, when the number of nodes reach a certain value, the recognition rate will reach a maximum, then it will gradually decline. (b) It will lead to low recognition rate with limited EN and MN. Meanwhile, the excessively abundant EN and MN will lead to additional computation. Therefore, we set MN and EN as 800-6000, 400-3000, and 400-6000 for the three data sets respectively. In addition, the l 2 regularization parameter and the shrinkage scale of the EN are 0.05 and 0.9.

Conclusion
In this article, we have presented a new method for HAR with depth videos. It consists of our proposed features called MLFF and a FBLS-MD. MLFF descriptors are designed to describe spatiotemporal and motion information more abundantly. Moreover, it is robust to noise and occlusion. FBLS-MD is proposed to effectively reduce the training time and obtain satisfied classification results. Extensive experiments have been performed on three benchmarks data sets to verify the effectiveness of our method. Experimental results have shown that our method outperforms state-of-theart methods. It has also been demonstrated that our method holds an advantage with a small training set.