Deep aligned feature extraction for collaborative-representation-based face classification with group dictionary selection

Face recognition plays an important role in many robotic and human–computer interaction systems. To this end, in recent years, sparse-representation-based classification and its variants have drawn extensive attention in compress sensing and pattern recognition. For image classification, one key to the success of a sparse-representation-based approach is to extract consistent image feature representations for the images of the same subject captured under a wide spectrum of appearance variations, for example, in pose, expression and illumination. These variations can be categorized into two main types: geometric and textural variations. To eliminate the difficulties posed by different appearance variations, the article presents a new collaborative-representation-based face classification approach using deep aligned neural network features. To be more specific, we first apply a facial landmark detection network to an input face image to obtain its fine-grained geometric information in the form of a set of 2D facial landmarks. These facial landmarks are then used to perform 2D geometric alignment across different face images. Second, we apply a deep neural network for facial image feature extraction due to the robustness of deep image features to a variety of appearance variations. We use the term deep aligned features for this two-step feature extraction approach. Last, a new collaborative-representation-based classification method is used to perform face classification. Specifically, we propose a group dictionary selection method for representation-based face classification to further boost the performance and reduce the uncertainty in decision-making. Experimental results obtained on several facial landmark detection and face classification data sets validate the effectiveness of the proposed method.


Introduction
In robotic and human-interaction systems, it is crucial for a computer to know more information of a customer, such as the identity, gender, age, behaviour, emotion and so on. [1][2][3][4][5] Among these labels/attributes, the identity of a customer might be the most important information for a robot. With this information, the robot could provide customised services thus to improve users' experiences significantly. To achieve this goal, a facial recognition system can be deployed on the robot or on a cloud server. For example, as shown in Figure 1, a healthcare robot can capture the facial image of a patient and send the image to the cloud server via a high-speed 5G wireless communication network. Then the cloud server can process the received facial image by searching the database and send the identity and other relevant information of the patient back to the robot. With the information delivered by the cloud server, the robot can provide customised advice or service to the patient. For example, the robot can remind a patient with hypertension to take her/his medicine or show the latest diagnosis report to the patient.
A typical face recognition or classification system has two main steps: feature extraction and face matching. For feature extraction, the use of deep neural networks (DNNs) has become the mainstream in recent years due to their outstanding performance in extracting robust facial features. 6,7 Once we obtain the DNN features of a user, we can perform face matching by comparing its facial features with all the existing users' facial features in a database. To perform face matching, the sparse-representation-based classification (SRC) method has been widely and successfully used in recent years. SRC has also drawn extensive attention in a variety of signal processing and image analysis applications, for example, signal encoding, image compression, feature representation, video analysis and image classification. [8][9][10][11][12][13][14][15][16] For face matching or classification, the key idea of SRC is to obtain the high-fidelity representation of a test sample using a dictionary with sparsity constraints, leading to promising classification results. To be more specific, SRC aims to reconstruct a test sample using a dictionary consisting of all the training samples of all the classes. Meanwhile, the reconstruction coefficients are regularised by the ' 1 norm. The label of the class that has the smallest reconstruction error is assigned as the label of a new test sample. However, it is time-consuming to solve such a ' 1 -regularised optimisation problem. To address this issue, Song et al. proposed a quadratic optimisation method with dimensionality reduction to improve the speed of SRC. 7 Zhang et al. proposed the collaborative-representation-based classification (CRC) method, 17 in which the representation coefficients are regularised by the ' 2 norm. In contrast to SRC, the reconstruction coefficients of CRC can be solved very efficiently with closed-form solutions. Additionally, CRC achieves better performance for face classification in terms of accuracy. Note that, both SRC and CRC use a dictionary that consists of all the samples of all the classes. In this case, all the training samples contribute to the reconstruction of a new test sample. A representation coefficient vector is obtained by solving the ' 1 -or ' 2 -regularised optimisation problem across all the classes. In contrast to SRC and CRC, Naseem et al. proposed a class-specific method, namely linearregression-based classification (LRC), in which the training samples of each class are used to form a class-specific dictionary. Those class-specific dictionaries are used to reconstruct a new test sample separately and the label of the one with the smallest reconstruction error is assigned to the test sample. In this article, we proposed a new CRCbased face classification framework due to the efficiency as well as the promising classification accuracy of CRC.
Despite the success of the aforementioned representationbased classification algorithms, it is still a very challenging task to perform robust face classification under unconstrained scenarios in the presence of a wide spectrum of appearance variations, for example, in pose, expression, illumination and occlusion. To address these issues and further strengthen the representative and discriminative capabilities of a representation-based classification method, a number of approaches have been developed in recent years. For example, Deng et al. 18 proposed an extended SRC method (ESRC) for face classification, in which an auxiliary intraclass variant dictionary is used to address the small sample size problem as well as the difficulties posed by occlusion and illumination variations. Yang et al. 19 and Zhu et al. 20 introduced similarity and distinctive of features to present a more general model of CRC. Xu et al. proposed an efficient SRC by means of an improved norm minimisation. 21,22 Guo et al. proposed two weighted discriminative collaborative competitive representation methods with the ' 1 -norm fidelity for robust image classification. 23 To improve the performance of a representation-based method for pose-invariant face classification, Song et al. proposed to use a 3D morphable face model for dictionary augmentation, in which a 3D face model is fitted to a 2D face image and the reconstructed 3D face is used to synthesise auxiliary 2D faces with different views. 11 Mounsef and Karam proposed an augmented SRC framework to improve the performance of the conventional SRC method in the presence of different facial appearance variations. 24 One common characteristic of those improved algorithms is that they all try to eliminate the difficulties posed by either textural or geometric appearance variations. To further address this issue and improve the performance of a representation-based face classification method, this article presents a new framework using deep align neural network features and a group dictionary selection (GDS) strategy.
As shown in Figure 2, the proposed framework includes three steps: geometric face alignment, robust deep textural feature extraction and collaborative-representation-based face classification with GDS.
The first step, geometric face alignment, is used to address the difficulties posed by geometric appearance variations such as expression, pose and other rigid transformations (translation, scale and rotation). Face alignment, also known as facial landmark detection, plays a very important role in many facial image analysis tasks, for example, face recognition, face tracking, face animation and 3D face modelling (https://www.nist.gov/programs-projects/facerecognition-grand-challenge-frgc). [25][26][27] As a preprocessing stage, geometric face alignment influences the performance of a facial image analysis application to a great extent. 6 However, it is still very challenging to perform robust face alignment under unconstrained scenarios in the presence of large pose variations, abrupt illumination changes, extreme facial expressions and heavy occlusions. To improve the accuracy of face alignment, cascaded shape regression has been proposed and has become very popular in recent years. [28][29][30][31][32] However, cascaded shape regression usually relies on hand-crafted features and weak regression methods, which cannot address the difficulties posed by appearance variations very well. More recently, DNNs have become the trend in face alignment. [33][34][35][36][37][38] For example, Feng et al. proposed a Wing loss function for convolutional neural network (CNN)-based face alignment, which improves the performance of regression-based face alignment with CNNs significantly. 39 In this article, we use a modified regression visual geometry group (VGG) architecture to obtain fine-grained geometric facial features in the form of a set of 2D facial landmarks. Those facial landmarks are then used to perform 2D geometric normalisation across different face images, using the piecewise affine warp method.
Apart from geometric facial image alignment/normalisation, robust textural facial image feature extraction methods have also been developed to enhance the performance of representation-based classification algorithms.
A number of studies have demonstrated that robust image feature extraction methods can promote the performance of pattern classification significantly. Classical representation-based classification methods, such as SRC, CRC and LRC, are usually based on image intensities thus perform poorly in unconstrained scenarios. To address this issue, many robust image feature descriptors, for example, local binary patterns 40,41 and Gabor, 42,43 have been used in representation-based face classification and demonstrated significant improvements in accuracy. More recently, with the great success of DNNs, CNNs have been proven to be very effective in extracting robust image features for a variety of image classification tasks. 6,[44][45][46] However, the use of deep image features in the representation-based classification paradigm has been less investigated by the community. Most existing deep-learning-based image classification methods just simply use the nearest neighbour classifier. This motivates us to further explore the use of deep CNN features for representation-based face classification. To this end, we extract deep CNN features from geometric aligned facial images for CRC-based face classification.
Classical representation-based classification methods, such as SRC and CRC, try to find a representation coefficient for a new test sample using a dictionary consisting of all the training samples of all the classes. In this case, all the classes contribute to the reconstruction of the test sample, which brings uncertainty in decision-making. Although the use of geometric face alignment and DNN features is able to reduce the uncertainty to some extent, the information of a dictionary consisting of all the training samples is redundant. To obtain a compact dictionary and reduce the uncertainty in decision-making, we propose a GDS approach. The proposed GDS approach has two steps. In the first step, we use the classical CRC method to calculate the representation coefficients. Then a measure is used to calculate the response of each class and only the ones with higher responses are selected to form a new dictionary for the final decision-making step. Experimental results demonstrate that the use of our proposed GDS method improves the accuracy of face classification further.
To summary, the main contributions of the proposed method include a new collaborative-representation-based face classification framework using robust deep aligned image features; an effective GDS approach that reduces information redundancy and uncertainty in decision-making; promising experimental results are obtained for both face alignment and face classification on several well-known face benchmarking data sets.
In the next section, we first introduce the classical CRC method, which is the foundation of the proposed algorithm. Then the details of the proposed method are presented in the 'proposed framework' section. To validate the performance of the proposed method, we report the experimental results obtained on several well-known face alignment and face classification data sets in the 'experimental results' section. Last, in the 'conclusion and future work' section, we draw the conclusion of the proposed method and introduce our future work plans.

Background
In this section, we introduce the classical CRC method, which is the foundation of the proposed method in the next section.
Given K classes and each class has M training samples, we can form a dictionary X ¼ ½x 1;1 ; :::; where P is the dimensionality of each training sample and KM is the total number of samples, also known as atoms, of the dictionary. With this dictionary, a new test sample y 2 R P can be approximated by a linear combination of all the samples in the dictionary where a k;m is the entry of the coefficient vector corresponding to the mth training sample of the kth class, that is, x k;m 2 R P . The entry a k;m indicates the response of the corresponding training sample to the representation of the test sample y. To obtain the reconstruction of the test sample in equation (1), CRC solves the optimisation problem regularised by the ' 2 norm s:t: y ¼ Xα where X 2 R PÂKM is the dictionary matrix consisting of all the training samples of all the classes and α ¼ ½a 1;1 ; :::; a KM T is the coefficient vector estimated by solving the ' 2 -norm minimisation problem. The optimisation of equation (2) is a typical least square problem, which can be efficiently achieved by the closed-form solution where is a small positive constant controlling the influence of the ' 2 regularisation term, and I is the identity matrix.
Once the coefficient vector is obtained, we can measure the propensity of the kth class to the representation of the test sample where c k is the signal of the test sample reconstructed using the training samples of the kth class. Then we can measure the reconstruction error of the test sample using the kth class which describes the dissimilarity between the test sample and the kth class. Last, the label of the test sample y is assigned by using the label of the class with the smallest reconstruction error In spite of the success of CRC-based approaches in face classification, there are some issues of the existing CRC methods. One key issue is that a CRC-based face classification method is sensitive to the geometric variations of a human face. To mitigate this issue and further improve the face classification accuracy, we propose a new framework for CRC-based face classification in the next section.

The proposed framework
To improve the performance of the classical CRC-based face classification method in accuracy, we propose a new geometry aligned facial feature extraction method in this section. In addition, we present a novel group feature selection method for a further performance boost. The proposed CRC-based face classification framework using deep aligned features (DAFs) and GDS has three main steps: geometric face alignment, deep textural image feature extraction and CRC-based face classification with GDS. The pipeline of the proposed method is depicted in Figure 2. To perform robust geometric face alignment, we first use a deep convolutional neural network to predict fine-grained 2D facial landmarks. Then the piecewise affine warp method is applied to both training and test images for geometric face alignment using predicted 2D facial landmarks. Next, we use another deep convolutional neural network to extract robust facial features. Last, the deep aligned facial features are used to perform face classification using a novel CRC-based classification method with GDS.

Geometric face alignment
To perform geometric face alignment, we first detect 2D facial landmarks of each training or testing image, using a state-of-the-art deep convolutional neural network, that is, the VGG face model. 47 However, the classical VGG face model is designed for the image classification task, whereas facial landmark detection is a regression task. To meet the requirement of the regression-based 2D facial landmark detection task, we modify the classical VGG network architecture. The classical VGG network has 12 convolution layers, 5 max pooling layers, 3 fully connected layers and a softmax layer. In VGG, each convolution layer is followed by a ReLU non-linear activation layer. We replace the softmax layer and the last fully connected layer by a densely connected regression layer. The architecture of the modified regression VGG network is shown in Figure 3. As depicted in the figure, the input for the network is a 224 Â 224 Â 3 colour image, I, and the output is a 136 Â 1 vector, s ¼ ½x 1 ; y 1 ; :::; x 68 ; y 68 T , consisting of the 2D coordinates of 68 facial landmarks.
To train the facial landmark detection network, we use the 300 W face data set. 48 Each face image in the 300W data set has 68 manually annotated 2D facial landmarks. More details of the 300W data set is give in Experimental results Section. As the L2 loss function is sensitive to outliers, 33,49 we use the L1 loss function for network training where ðdx n;i ; dy n;i Þ are the differences between the groundtruth coordinates and the predicted coordinates of the network for the ith facial landmark of the nth sample in each mini-batch, and N is the mini-batch size. To train the neural network, we use the stochastic gradient descent optimisation method with momentum. The network is trained for 400 epochs and we set the mini-batch size as N ¼ 16.
After the training of the facial landmark detection network, we apply it to all the training samples and testing samples. Then we apply the piecewise affine warp method to perform geometric face image alignment/normalisation. For more details of the piecewise affine warp method, readers are referred to Matthews and Baker. 50 To be more specific, we map the texture of an input face image, I, from its original shape, s, to the mean shape, s. The mean shape is calculated using the facial landmark of all the training samples in a dictionary where s k;m is the predicted 2D facial landmark of the mth sample in the kth class, using our facial landmark detection network. Some examples of the face image normalisation step using piecewise affine warp are shown in Figure 4. To maintain the background of an input face image, 4 anchor points are added to the obtained 68 landmarks for piecewise affine warp, as shown in the figure. We use the term I 0 for a geometric aligned face image. According to the figure, we can see that the original input images of the same subject may vary significantly due to different geometric variations, such as pose variation in the first column and scale variations in the third column. Those geometric variations may cause difficulties in CRC-based face classification. After geometric face alignment, the facial parts with same semantic meanings are very well aligned thus the robustness and accuracy of a face classification method can be improved.

DAF extraction
Recently, DNNs especially CNNs have been successfully used for a wide range of image classification tasks. A DNN is able to extract robust facial features for accurate face classification with appearance variations. Instead of the original image intensity feature that has been widely used in many representation-based face classification approaches, we use deep CNN features in our proposed framework. Note that the proposed feature extraction method is applied to a geometric aligned facial image, I 0 , as introduced in the last subsection, rather than the original input image I. We use the term DAFs for the proposed CNN-based feature extraction method.

4.67%
a The normalised mean error by inter-ocular distance is used as the evaluation metric. The bold font indicates the best results in each subset.
extraction. Such a face classification network is usually pre-trained on a large-scale face data set with thousands of identities and it is supposed to generalise well to new identities. To perform face classification for a new subject, the image features of all the gallery images and a probe image are extracted by the pre-trained deep network and the nearest neighbour classifier is usually used to perform face classification. The label of the gallery image with the shortest distance, for example, the cosine distance, to the probe image is assigned to the probe image. However, according to our preliminary experimental results, we found that the combination of cosine distance and the nearest neighbour classifier does not work very well in such a practical application setting.
In this article, we propose to use a representation-based classifier, that is, CRC, for image classification.
To extract deep CNN features, we use the well-known VGG face model. 47 Specifically, for each geometric aligned face image in a dictionary, I 0 k;m , we apply the classical VGG face model to extract facial features where f ðÞ stands for the VGG face model that is actually a nonlinear mapping function. We use the 4096-D vector of the second fully connected layer in the classical VGG face model as the facial features. Note that, we also apply the VGG face model to each geometric aligned test image for robust DAF extraction.

GDS for CRC-based face classification
As discussed in the last two subsections, to perform face classification, we use a pre-trained DNN to extract deep aligned CNN features. The main reason to use a pre-trained DNN is due to the fact that we usually have very few training samples for training or fine-tuning a DNN in many practical applications such as access control. However, one issue to use a pre-trained deep CNN is that the network may not generalise well for a new domain. The extracted facial image features are redundant and the standard nearest neighbour classifier cannot address this issue. To mitigate this difficulty, we propose to use a representation-based face classification method, that is, CRC. However, the classical CRC method still has difficulties in addressing the issues posed by information redundancy. Redundant facial image features may lead to uncertainty in decision-making especially when we have a large number of classes. To be more specific, all the samples of all the classes in a dictionary contribute to the reconstruction of a test sample in CRC. Sometimes, a training sample that does not belong to the same label of the test sample may obtain a high response in CRC, which leads to classification errors. To reduce the information redundancy of the use of a pre-trained deep CNN as well as uncertainty in decision-making, we propose a dictionary optimisation approach, namely GDS. Given a test image,Î, we first use our 2D face alignment network to obtain its 68 2D facial landmarks for geometric image alignment that outputs the normalised face imageÎ 0 .
Then a pre-trained deep CNN, that is, the classical VGG face model, is used to obtain the DAFs for the test image, which is denoted by y 2 R P . Suppose we have a dictionary X 2 R PÂKM that consists of the deep aligned CNN features of all the gallery images, we first apply the classical CRC method to obtain the representation coefficient vector, α ¼ ½a 1;1 ; :::; a K;M T 2 R KM , using equation (3), where a k;m is the reconstruction coefficient of the mth sample of the kth class for the representation of the test sample. Then a ' 2 -based measure is applied to the elements of the kth class which measures the contribution of the training samples from the kth class to the representation of the test sample. Then, we rank the contributions of all the classes and only select a pre-defined proportion of the higher ranking classes to create a new dictionaryX. We use the parameter S 2 ð0; 1Þ for the proportion of the classes selected by the proposed method. In this article, we set the selection proportion as S ¼ 0:15, which means that only 15% classes are selected to create the optimised dictionary for the final decision making. Note that we use the term GDS for the proposed method because we select all the samples of a class due to the contribution of the whole class to the representation of a new test sample, as described in equation (10). Then the new dictionary is used to perform face classification based on the CRC method introduced in the second section. The proposed CRC-based face classification method with GDS is summarised in Algorithm 1.

Experimental results
In this section, we evaluate the performance of the proposed method in terms of accuracy. To be more specific, we first evaluate the performance of the proposed facial landmark detection network on the 300W data set 48 in terms of accuracy. Then we compare the proposed CRC-  based face classification method using deep aligned facial features and GDS with the state-of-the-art approaches on several face classification data sets. In addition, we analyse the effects of each component in the proposed framework, including geometric face aliment, deep CNN feature extraction and GDS, on those face classification data sets. It should be highlighted that the face images from these data sets were captured under a wide spectrum of appearance variations in illumination, pose and expression.

Evaluation on facial landmark detection
As the accuracy of 2D facial landmark detection is crucial for our proposed DAF extraction method, we first evaluate the proposed facial landmark detection network on the 300W face data set, 48 compared with a number of stateof-the-art facial landmark detection methods. The 300W data set has a number of facial images selected from different face data sets, including XM2VTS, 51 LFPW, 52 HELEN, 53 Face Recognition Grand Challenge (FRGC) 54 and AFW. 55 300W has been widely used for benchmarking 2D facial landmark detection algorithms. In this article, we follow the protocol used in Ren et al. 56 This protocol uses AFW, the training sets of LFPW and Helen to create the training set. In total, the training set has 3148 images. The test set of the protocol consists of IBUG, the test sets of LFPW and Helen. In total, the test set has 689 images, which are divided into two subsets: common and challenging subsets. The evaluation metric is the normalised mean error, which is the mean of the Euclidean distance between a predicted landmark and its ground-truth value over all the facial landmarks, normalised by the inter-ocular distance. We compare the proposed 2D facial landmark detection network with a set of state-of-the-art approaches in Table 1. From the table, we can see that the proposed facial landmark detection network outperforms the state-of-the-art approaches in terms of accuracy on both the common subset and the full set. It achieves competitive results on the challenge set as well, which is only slightly worse than the TR-DRN method. This validates the robustness and accuracy of the proposed method in facial landmark detection. We show some results of the detected 2D facial landmarks on the 300W test set using our proposed facial landmark detection network in Figure 5. We can see that the proposed facial landmark detection method is very robust to a variety of face appearance variations, such as pose, expression, illumination and so on.

Evaluation on face classification
In this part, we first compare the proposed face classification framework with a number of representation-based approaches on several well-known face classification data sets, including FERET, 65 FRGC 1 , LFW 66 and CMU-PIE. 67 Then we conduct an ablation study for the proposed method, in which we analyse each component of the proposed approach, including geometric face alignment, deep CNN feature extraction as well as the proposed GDS method.
Results on FERET. The FERET face database is an output of the FERET program, which was sponsored by the US Department of Defence through the DARPA program. This database has become a widely used benchmarking database for the evaluation of face classification techniques. The proposed algorithm was evaluated on a subset of the FERET database, which includes 1400 images of 200 subjects, each has seven different images. Some example images of the FERET data set are shown in Figure 6. To apply the VGG face model to those images for DAF extraction, we converted each FERET image to a colour image by copying the single channel grey-level image to each RGB channel.  For the FERET database, we used qðq ¼ 3; 4; 5Þ samples per class as the gallery set, and the remaining samples were used for testing. The classification result of our method is compared to various representation-based methods, including CRC, 17 LRC, 68 ESRC, 18 RRC, 69 RCR, 19 TPTSR, 70 SLC-ADL, 71 TS-LSRC, 14 Homotopy, 72 DALM, 73 FISTA 74 and DSRL2. 75 As shown in Table 2, we can see that the proposed method consistently achieves much better classification results than the other algorithms, regardless of the number of gallery samples.
Results on FRGC. The FRGC version 2 database consists of both constrained and unconstrained facial images. The constrained images have good image quality, whereas the lowquality unconstrained images were captured under complex backgrounds. In this article, we select 100 subjects, each with 30 different images, from FRGC to construct our experimental subset. Some example images of the FRGC database are shown in Figure 7.
For the experiments conducted on the FRGC database, we selected qðq ¼ 5; 10; 15Þ samples per subject as the gallery set, and the remaining face images were used for testing. The proposed method is compared with CRC, 17 LRC, 68 ESRC, 18 TPTSR, 70 SLC-ADL, 71 TS-LSRC 14 on FRGC in terms of face classification accuracy. Table 3 reports the results of each method using different numbers of training samples. According to the table, the proposed method achieves much better results than the others.
Results on PIE. The CMU-PIE face database consists of 41,368 images of 68 individuals with mixed intraclass variations introduced by three types of interference, including 13 different poses, 43 different illumination conditions and 4 different expressions. This database has also become a benchmark database for the evaluation of a face classification algorithm. In this article, the proposed algorithm was evaluated on a subset of the CMU-PIE database, which includes 6800 images of 68 individuals with 20 different images (10 poses and 10 illuminations) of each subject. Some example images of the CMU-PIE database are shown in Figure 8.
For the experiment on the CMU-PIE database, we used qðq ¼ 5; 10; 15Þ samples per class to construct the gallery set and the remaining images were used for the test. As shown in Table 4, the proposed method achieves better classification accuracy across different sizes of training samples than the other methods including CRC, 17 LRC, 68 ESRC, 18 Figure 9.
For the experiment conducted on the LFW database, qðq ¼ 1; 2; 3; 4Þ samples per subject were randomly selected as gallery images and the remaining images were used for the test. A comparison of the proposed framework with various methods, including CRC, 17 LRC, 68 ESRC, 18 TPTSR, 70 SLC-ADL, 71 TS-LSRC, 14 is presented in Table 5. Again, the proposed method outperforms all the other method to a great extent in terms of face classification accuracy on the LFW data set, which demonstrates the superiority of the proposed method. The key to the success of the proposed method is due to the proposed DAF extraction method as well as the use of the GDS strategy. To analyse the effects of each component of the proposed   Table 6. For the FERET and LFW data sets, three samples of each subject were used as the dictionary. For FRGC and CMU-PIE, five samples of each subject were used as the dictionary. All the rest samples of each subject were used as testing images. From Table 6, we can see that 2D geometric face alignment/normalisation is able to improve the face classification accuracy of CRC significantly, AF-CRC v.s. CRC, especially for a data set with pose variations such as FERET and PIE. The main reason is that the geometric normalisation step provides semantic consistency across different face images in pixel level, which is crucial for a representation-based face classification approach. In addition, the use of deep aligned CNN features (DAF-CRC) further improves the performance of CRC in terms of accuracy for face classification. Last, with the proposed GDS method (DAF-CRC-GDS), the face classification accuracy is further improved on all the data sets. This experiment validates the effectiveness of the proposed method as well as different advocated elements in the processing pipeline.  Figure 10. The YouTube-Faces data set. The first row shows some example faces that are used for the enrolment of our database. The second row shows some example faces that are used for the test of the proposed method.
Simulation. To further evaluate the proposed method in practical applications, we simulate a robotic application scenario on a laptop using the YouTube-Faces data set. 77 To be more specific, we select 200 videos of 100 identities from the YouTube-Faces data set. We enrol these 100 identities to our database as the gallery set using the frames of one video of each identity. For test, the other video of each identity is used, which is different from the video used for identity enrolment of our system. Some example faces from the YouTube-Faces data set are shown in Figure 10.
To simulate the proposed face recognition method in practical scenarios using the YouTube-Faces data set, we use the Single Shot multibox face Detector 78 to detect the face in each frame of a test video. Then the proposed VGG-based facial landmark detection method is used to obtain facial landmarks and perform geometric face alignment. Last, the proposed CRC-based face classification method with GDS is used for face recognition. We show the simulation results of the 10th, 20th, 30th, 40th and 50th frames of a video evaluated on the YouTube-Faces data set in Figure 11. According to the simulation, we can see that the proposed method perform well in practical scenarios.

Conclusion and future work
In this article, we propose a face classification framework using DAFs. The proposed method uses facial landmarks and piecewise affine warp to perform geometric face alignment for robust deep convolutional neural network based facial feature extraction. In addition, a new GDS approach is proposed to further improve the performance of the collaborative-representation-based face classification method. The experiment results obtained on several benchmarking data sets and the simulation results obtained on the YouTube-Faces data set demonstrate the effectiveness of the proposed method. However, based on our simulation results, we find that the proposed method can only perform well when the yaw rotation of a facial image is smaller than around 60 . For the facial images with very large pose variations, the proposed face recognition system becomes unstable, resulting in inaccurate results and failures. The other drawback of the proposed face recognition method is that it cannot deal with motion blur very well. For video-based face recognition applications, we find the fast movement of human faces may lead to severe image blur that degrades the performance of the proposed face recognition system. The above findings motivate us to further improve our face recognition system in the presence of large pose variations and motion blur.
The other challenge or limitation of the proposed method is the deployment of our system in robotics applications. One main reason is that the proposed method is based on large capacity deep CNNs that require highperformance GPU devices to achieve real-time inference speed in practical applications. This can only be done on a cloud server at the current stage due to the energy costs of GPU devices. In our future work, we aim to address this issue by reducing the computational complexity of the proposed method. For example, we intend to design new lightweight DNNs that perform equally well as a large capacity network in facial landmark detection and face classification. Our ultimate goal is to perform accurate and real-time face classification on lightweight edge computing platforms, such as a robot.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article:

Supplemental material
Supplemental material for this article is available online.