A Spatiotemporal Robust Approach for Human Activity Recognition

Nowadays, human activity recognition is considered to be one of the fundamental topics in computer vision research areas, including human-robot interaction. In this work, a novel method is proposed utilizing the depth and optical flow motion information of human silhouettes from video for human activity recognition. The recognition method utilizes enhanced independent component analysis (EICA) on depth silhouettes, optical flow motion features, and hidden Markov models (HMMs) for recognition. The local features are extracted from the collection of the depth silhouettes exhibiting various human activities. Optical flow-based motion features are also extracted from the depth silhouette area and used in an augmented form to form the spatiotemporal features. Next, the augmented features are enhanced by generalized discriminant analysis (GDA) for better activity representation. These features are then fed into HMMs to model human activities and recognize them. The experimental results show the superiority of the proposed approach over the conventional ones.


Introduction
As human activity recognition (HAR) technology provides computers with a way of sensing human's activity information through video cameras, it can contribute significantly to consumer systems by responding to users' conditions [1]- [3]. Recently, HAR has received considerable attention in the field of robotics as well as in the computer vision research community, as robots can cooperate with humans in many applications, such as smart home healthcare [4], [5].
Binary silhouettes are the most famous representation for HAR from which useful features are derived [6]- [9]. In [6] and [7], the authors applied binary silhouette-based features to represent some human activities for recognition. In [8], the author applied a binary body lookup table to represent the 3-D poses of human movements [8]. In [9], the author utilized binary silhouettes to recover distinguished 3-D human body poses. Binary silhouettes have been applied in gait analysis as well [10], [11]. However, binary silhouettes are not inadequate in representing the human body in activity videos due to its two-level pixel value distribution. Binary silhouettes cannot discern the far and near parts of the human body. In this regard, depth information-based whole-body representation for HAR can represent a good choice, as mentioned in [12]. Depth information has also been applied in gesture recognition [13], [14] and 3-D mesh construction [15]. Furthermore, it would be wise to include motion features (i.e., optical flows) with the spatial features, as each activity contains different motion information compared with every other. Thus, by augmenting the local silhouette features with the optical flow features, one should be able to come up with robust HAR systems. For motion information-based HAR, several works have been published describing various human activities [5], [6], [16], [17]. In [16], the authors applied optical flows to compute the likelihood of the observed activity [16]. In [5] and [6], the authors used optical flows for human activity feature representations [5], [6]. In [17], the author utilized optical flows to extract features from the activity video frames [17]. However, these studies on motion information have seen limited success in HAR. General discriminant analysis (GDA), a nonlinear approach to cluster patterns of different clusters, has recently been used in many pattern analysis applications, where it outperforms linear discriminant analysis (LDA) that classifies the samples linearly [18], [19]. Hence, GDA may be useful in obtaining robust features for better HAR.
Among various applications in which hidden Markov models (HMMs) have been applied successfully to recognize complex time-sequential events, HAR is one active area that utilizes time-sequential information from video to recognize various human activities [4]- [6], [9], [12]. In an HMM, the underlying process is usually not observable, but it can be observed through another set of stochastic processes that produces observations. However, the common method for HAR using HMM is to model key features from time-sequential activity images, in which various activities are represented in timesequential silhouettes. Once the silhouettes from the video are extracted, each activity is recognized by comparison with the trained activity features. Thus, the feature extraction, training and recognition of HAR play the main roles in this regard. Thus, HMM is adopted in this work as it is considered to be a robust tool in modelling time-sequential information, such as activity video.
In this work, enhanced independent component analysis (EICA) is applied first in order to extract prominent features from the silhouettes. EICA is a higher-order statistical technique that extracts prominent local features from depth silhouettes when compared to Principal Component Analysis (PCA), which produces global features based on second-order statistics [12]. Next, upon proving the depth features in HAR, the silhouette features are augmented with the optical flow features.
Finally, generalized discriminant analysis (GDA) is applied on the augmented features and combined with HMMs to represent a robust HAR system. The proposed system is compared against the binary silhouette-based systems and achieves superiority. One of the aims of the proposed HAR system is to be used in smart environments to monitor and recognize general human activities, which should allow the continuous daily, monthly and yearly analysis of human activity patterns, habits and requirements.
The rest of the sections in this wok are structured as follows. Section 2 represents the methodology of the proposed system, Section 3 the experimental setups as well as recognition results utilizing different feature-extraction approaches, and Section 4 the concluding remarks.

Proposed HAR Methodology
The proposed HAR system starts with the processing of the depth silhouette and optical flow information from the time-sequential activity video images. Fig. 1 shows the basic steps of the proposed HAR system.

Silhouette Acquisition and Pre-processing
The major objective of silhouette feature extraction is to find an efficient representation of the human body posture in an optimal feature space. Here, ICA is used to extract local features from the depth silhouettes.
Let us assume that in each video a human performs a single activity. The RGB and depth images of different activities are acquired by a commercial depth camera [20]. The depth video indicates the range of each pixel in the scene to the camera as a grey-scale value, such as the shorter-ranged pixels having brighter values and the longer ones having darker values. The depth camera provides both the RGB and depth images simultaneously. Figs. 2(a), 2(b) and 2(c) represent sample RGB, depth and binary images, respectively, from walking activity. In the depth image, the higher pixel intensity indicates the nearer and the lower the further distance. For instance, in Fig. 2(b), the left-hand region is brighter than the righthand. Thus, different body components used in the activity can be represented effectively by the depth map and, hence, can contribute more effectively in the feature generation.

Spatial Feature Extraction
For spatial feature extraction, EICA (ICA on the principal components) is applied on the depth silhouettes. After the pre-processing of the silhouette vectors, PCA is applied first, considering Q as the covariance data matrix of the silhouette vectors. To obtain the principal components (PCs) from , Q it can be represented as: where P indicates the eigenvector matrix and Λ the diagonal eigenvalue matrix. Fig. 4(a) shows the first 120 eigenvalues corresponding to the first 120 eigenvectors (i.e., PCs) after applying PCA for five different activities. In PCA, the eigenvalues approach zero, which indicates that the corresponding PCs carry negligible importance to be applied for EICA. Hence, in this work, 100 PCs are selected for EICA. Fig. 4(b) shows eight PC feature images of human depth silhouettes that represent global features, such as average silhouettes.  Next, ICA is applied on the PCs to focus on the local feature extraction of the silhouettes as a part of EICA. ICA is a blind source separation method that decomposes a mixture of observed variables into a linear combination of some unknown components and their mixing matrix.
Let Y and X be the collection of the basis and input silhouette vectors, respectively. Thus, Y and S can be modelled as: where G is an unknown mixing matrix. Thus, the ICA algorithm tries to find a separating matrix W such that: where V represents the estimated independent sources. Independent Components (ICs) are selected for EICA feature representation.

Temporal Feature Extraction
For temporal feature extraction, the optical flow-based motion features are derived to augment with the spatial depth silhouette features.
After extracting the depth images by means of the depth camera, the optical flows of the silhouette region are obtained from the consecutive depth images. For optical flow computation, the Lucas-Kanade method has been utilized [21]. .

Spatiotemporal Features
Since EIC features represent the local spatial features of the depth silhouettes and the PC features of the optical flows for motion-based temporal feature representation, both the features for a frame can be augmented together for more robust features. So, the augmented motion and depth silhouette features from the th i frame can be represented as

Activity Modelling, Training and Recognition
A HMM is a collection of finite states connected by transitions where every state contains a transition probability to another state and a symbol observation probability. In a HMM, the underlying hidden process is observable by another set of stochastic processes that produces observation symbols. The basic theory of HMMs was developed by Baum et al., and it has been applied successfully in many applications [22]. A HMM is denoted as where Ξ represents the possible states, π the initial probability of the states, τ the state transition probability matrix, and ζ the observation symbols' probability matrix.
Before applying a discrete HMM for HAR, each activity image is symbolized through a codebook generated from trained feature vectors. As such, an efficient codebook of vectors is generated first using vector quantization from the training feature vectors. To recognize the five activities, a codebook with the size of 32 is applied using Linde, Buzo and Gray's (LBG) [22] clustering algorithm. However, the index numbers of the codebook vectors are used as symbols to apply on the HMM. Thus, the symbols are the observations, denoted by  .
In a learning HMM, each HMM corresponding to a distinct activity is optimized by the symbol sequences obtained from the training image sequences of that activity. Thus, each activity is represented by a distinct trained HMM. Thus, for N activities, there will be a dictionary of N trained HMMs. Fig. 7 represents the To recognize an activity in a depth video, an observation symbol sequence is obtained and applied on all the trained HMMs to calculate the likelihood, and one is chosen with the highest probability. For instance, to test a sequence  , the appropriate HMM is found as: 1 arg max(Pr( | )).

Experimental Setup and Results
The depth silhouette-based activity database consisted of five activities (viz., walking, running, skipping, sitting down and standing up). For training and testing, 15 and 40 videos of different lengths were used, respectively.
The silhouette-based HAR was tried first. Four different feature extraction methods (i.e., PCA, LDA on PCA, general ICA and EICA) were utilized to evaluate their performances on the binary and depth silhouette-based activity recognition, respectively. Table 1 shows the recognition results of the binary silhouette-based experiments utilizing the different feature extraction approaches, in which the EICA-features show the highest recognition rate (i.e., 84.50%). Next, depth silhouettebased methods experimented with the same feature extraction techniques as the binary silhouette-based ones.  Utilizing the optical flow features with HMM, a good mean recognition rate (i.e., 90.50%) was obtained. In addition, PCA on the optical flow vectors was applied and obtained a better mean recognition rate (i.e., 92%). Table 3 demonstrates the recognition results based on the optical flow features only, as mentioned above.
Finally, the EIC silhouette features with PC-based motion features were augmented and further extended by GDA to apply with HMM for better HAR, and achieved the highest recognition rate (i.e., 98%) among all the approaches. Table 4 shows the results using the proposed spatiotemporal features, hence indicating its superiority over the other feature extraction approaches and for robust HAR.

Conclusion
In this paper, a novel approach has been proposed for robust HAR from video based on the depth silhouette and optical flow motion features. The spatiotemporal (i.e., silhouette and motion) feature extraction approach with HMM provide a superior recognition rate than conventional ones, and so indicates a robust HAR system. The proposed HAR system can be adapted in various environments for smart human computer interaction.

Acknowledgements
This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIP)(No. 2008-0061908)