Emotion Walking for Humanoid Avatars Using Brain Signals

Interaction between humans and humanoid avatar representations is very important in virtual reality and robotics, since the humanoid avatar can represent either a human or a robot in a virtual environment. Many researchers have focused on providing natural interactions for humanoid avatars or even for robots with the use of camera tracking, gloves, giving them the ability to speak, brain interfaces and other devices. This paper provides a new multimodal interaction control for avatars by combining brain signals, facial muscle tension recognition and glove tracking to change the facial expression of humanoid avatars according to the user's emotional condition. The signals from brain activity and muscle movements are used as the emotional stimulator, while the glove acts as emotion intensity control for the avatar. This multimodal interface can determine when the humanoid avatar needs to change their facial expression or their walking power. The results show that humanoid avatar have different timelines of walking and facial expressions when the user stimulates them with different emotions. This finding is believed to provide new knowledge on controlling robots' and humanoid avatars' facial expressions and walking.


Introduction
Research on virtual reality and games still needs a lot of improvements, especially on how to immerse users and provide them with attractive interactions. Interactivity and immersiveness are considered as the main goals to be achieved in virtual reality games (VRG), particularly for educational purposes, virtual training, virtual institutions, etc. Thus, the avatar is responsible for providing better interaction in virtual environments (Basori et al., 2009a, Basori et al., 2008b, Bogdanovych, 2007, Roussou, 2004, Yahaya, 2009). Why do virtual reality games need improvements? The reason is virtual reality application is not as natural as the actual world. Acosta (2001) mentioned that realistic virtual reality applications could "look real, act real, sound real, and feel realʺ. In that sense, the synthetic world, which is called the "world that look(s) real" must consist of complex visualizations and animations (complex models, navigation, user interfaces, etc.). The remark "act real" means every object in the virtual environment, especially virtual humans, should be able to behave like human beings. While, "sound real" can be interpreted as sound effects, such as a voice from the real world. Last, but not least, "feel real" refers to the presence of objects in the virtual environment. If all these requirements are fulfilled, the virtual reality game will provide full mental immersion where users cannot differentiate between the virtual world and the real-world (Acosta, 2001). Furthermore, the integration of visuals, acoustics and haptics is proposed as a means to increase the realism of avatars in a virtual environment (Basori et al., 2009b, Basori et al., 2008a. Controlling the expressions of avatars is another issue facing those conducting research into facial animation. They can change the facial animation controller from items such as a mouse, a keyboard or a joystick to a camera tracker, a special glove or a brain-computer interface (Basori et al., 2011a).
Faller, J., et al (2010) have proposed a brain interface that can be used by disabled and non-disabled people alike. The brain signal used for their research is steady-state visual evoked potentials (SSVEPs) which provides a fast information transfer rate (Faller et al., 2010). Other researchers have focused on controlling avatars by using conversations across a chat device and camera tracking (Neviarouskaya et al., 2009, Zhan et al., 2007.

Background to the Research
Facial expression and walking are features needed for humanoid avatars or robots to interact with humans. Motion planning in robotics plays an important role in allowing robots to be able to move intelligently. Algorithms such as the hierarchical memetic algorithm (MA) have been used for motion planning in robotics (Lin et al., 2012). This paper will focus on walking behaviour based on emotional conditions rather than on motion planning.
The facial expression of emotion was initiated by Ekman (1982). A standard guideline for the emotional facial expression of humans called FACS (Facial Action Coding Systems) has been introduced in 1978 ( Ekman and Friesen, 1978). Facial expressions can trigger facial animation improvement such as with the paramaterized facial model (Parke, 1971), facial rigging based on blend shape (Neuberger, 2010), facial rigging using clusters (Grubb, 2010) and facial rigging using interfaces (James, 2010). Fabri et al. (1999) showed that non-verbal communication in Collaborative Virtual Environments (CVEs) can be performed using face, gaze, gesture or even body posture. Until now, researchers are doing some expansion in terms of providing human likenesses to increase interaction and communication between computer and user (Fabri et al., 1999). Wang et.al. (2005) mentioned that there are two main problems to creating a virtual human. First, the construction of emotion and second the generation of the affection model which is purposely created to improve their presentation. The avatar does not only represent a human in terms of its physical representation; it also needs some context to make it believable. The current 3D humanoid models require improvement because they lack believability (Rojas et al., 2006). Rojas et.al. (2006) proposed an individualization method by giving the 3D humanoid model personality, emotion and gender. Zagalo and Torres (2008) suggested that the 3D humanoid model may turn into a character and be able to express their emotions by there being an act of touching between two characters. Melo and Paiva (2007) made some innovations in expressing the emotion of virtual characters by ignoring body parts. They used elements like shadow, light, composition and filter as tools for conveying the characters' emotion (Melo and Paiva, 2007). With regards to humanoid avatars, researchers have reached the stage where they add emotion to the avatar. Here, a different approach for expressing avatar emotions is proposed. Based on the existing facial expression controls for avatars, it is found that there is still room for improving natural interface controls. The previous discussion clearly shows that facial expression control using brain computer interfaces has not been greatly explored. Of course, from the research point of view, this challenge portrays a new landscape to overcome and new models of natural interactions to propose.

Critical Analysis of Facial Expression Control
Functions in the facial rigging are responsible for controlling joints, blending shapes and clustering to manipulate the face surface of the 3D model. Functions can be written into the equation's format to manipulate control parameters and the expected effect on the face surface. Functions can be extended to a user interface to provide the user with an easier control for the facial region. When using GUI mode, each control value on joint angles, cluster transformations, blend shapes and functional expression has a particular key frame position or particular times. However, by using GUI on a particular desired area to create effects, it will be easy to control the facial expression of an avatar. The best example is shown in Figure.1. The facial expression coding system, which is proposed by Ekman (Ekman, 1982, Ekman, 2003, Ekman and Friesen, 1978 has six basic emotions, which are anger, joy, sadness, fear, disgust and surprise. These emotions are used as a basis for creating the emotionally expressive avatar. As a continuation of this research, in 1990, Faigin presented a popular argument that emotions are mainly determined by three meaningful regions, namely the eyebrows, eyes and mouth, which became the universal expression of the avatar, see Fig.2, (Faigin, 1990). The expression of anger drags the eyebrows so as to be close to each other and lower than their normal position. While for strong anger, a human will usually open their mouth or even shout (see the illustration in Fig.2.-A). Joy or happiness is an expression shown through a relaxing of the facial muscles, lips are widely opened and eyebrows seem calm (Fig.2.-B). Sadness makes eyebrows appear to stretch upwards and the mouth is closed but not so tight. The lower eyelids are pulled downwards to make crying eyes ( Fig.2.-C).   Figure 3 shows an emotional expression of happiness, with a wide and closed smile. Another aspect of the appearance is that the inner and outer eyebrows look more relaxed and the character is also shown walking with full confidence. The previous discussion has shown the facial animation techniques that are widely used in facial animation application. The interpolation technique is one of the famous techniques used in facial animation. Further, this technique is enhanced by facial rigging to provide the user with easy interaction control. This study uses blend shape interpolation of the facial region to perform emotional facial expression. The process of interpolation starts from a neutral expression named 'base' and then the base will start to change into a desired pose based on the interpolation value. Bee, N., Falk, B. & Andr, E. (2009) came up with an emotional facial expression control using an Xbox joystick. The user will be able to create particular facial expression by pressing a specific button on the joystick. Their method will help the user to interact with the avatar's facial expression. The approach has inspired this research to come up with another control for facial expression using brain activity and hand tracking.

Methodology
Emotions are also expressed by changing the lip shape. By referring to FACS, we have created certain facial expressions for an avatar such as anger and happiness; we will concentrate on these two kinds of emotions. According to theories of emotion, anger is usually related to something that makes people feel uncomfortable. The aforementioned facial expressions involve several action units but do not consider the intensity or level of each emotion because they are mainly concerned with producing a realistic imitation of an emotional expression  Russell (1980) stated that 'angry' has a high Y value and a high Negative (-X) value (see Figure 5). In addition, researchers also study certain levels of colour, saturation and brightness that carry some emotional information like feelings of joy or sadness (Melo and Paiva, 2007).
Muscle and alpha signal from the user are used to determine their emotional classification. This signal is classified using Circumplex theory. Happiness and anger are two emotions that are clearly detected in this experiment. Happiness is in the pleased axis and a little bit in the excited, while anger is high in excitation but unpleased. Based on these criteria, we turn the signal obtained from muscle and alpha signals into certain emotions. Furthermore, the other input like the 5DT glove is used to adjust the intensity of the emotion according to the finger tracking.

Simulation Results
The Nia mind controller has several sensors attached to the user's head and is able to record brain activity during the interaction. The mind controller recognizes and analyses the brain activity signal and it will produce a classifying signal based on the emotional condition. Furthermore, the glove produces a signal interpreted according to the hand gesture shape. In this system, the glove only acts as the intensity controller of emotional expression, e.g. if the user's emotion is recognized as anger, then the strength of the anger will be decided through the gesture shape. Based on previous work, this study also used the gesture posture, such as a rounded fist gesture to represent anger, which is proposed by Mubin et al. (2007). After having finished reading the input, the system will continue the process in the OGRE game engine to load the 3D model of a facial model, preparing the interaction management to communicate with an external input/output (I/O) API such as a glove API, a mind controller API, a sound API or a haptic API. Afterwards, the system will produce an output which is a facial animation with a natural interaction control using brain signals and finger tracking. The details of the process are described in Figure 6. The computation of facial expressions is based on a calculation of each Action unit as shown in Figure 7. Each action unit is responsible for the strength and type of the emotion expressed. In this case, the expression of anger is more complex than the expression of happiness. The complexity will increase accordingly as the level of anger rises (refer to Figure 7).  AU1 and AU2 manage the eyebrow muscle and, together with AU4, perform happiness expression. Lip control and AU15 manage lip movement while emotion is being generated. All elements work together to perform the appearance of happiness while the level is used as a power control that determines the strength of the expression happiness to be rendered. Figure 9. shows happiness in the facial expressions of avatars at various levels of happiness. All these sensors correlate with excitation or level of activity happening inside the human body. For that reason, these sensors are suitable for recognizing the human emotion through the level of tension in the brain or the forehead facial muscle. The sensors described in Figure 10. are able to recognize brain activity and muscle tension change during interaction between the user and the 3D avatar. A glance sensor will be not be used in this study due to the fact that eye muscle movement is not used as an input stimulation in this system. Ekman and C.Hager (1983) have proposed FACS that combines several Action Units (AUs) to recognize and produce emotion from facial expression. The FACS in Figure 11. show that muscle tension has a high correlation with emotional condition. Therefore, this system chooses muscle and one of alpha sensors as stimulation inputs. Alpha and Beta sensors use a similar method of measuring brain activity, which is why Alpha1 is used to represent brain activity. The mean value between the muscle and Alpha sensor will be used as a final input for the stimulation. The emotional recognition will depend on the average signal intensity of the muscle and Alpha sensors (refer to Equation 1). calculates the emotional signal from two sources: muscle and alpha signal. The emotional signal will determine what kind of emotional will be performed by the facial expression, haptic vibration and acoustic effects. The tension level of the muscle and Alpha signal can be divided into four main zones as shown in Fig.12, e.g. Z1, Z2, Z3, and Z4. Z1 -Z4 is the intensity zone of each signal with which the level of excitation is classified, as shown in Fig.9. These four zones can be understood as low, medium lower, medium higher and high. Low and medium lower are associated with Z1 and Z2 with intention of capturing the feeling of relaxation, which is suitable for happiness. Nevertheless, Z3 and Z4 are medium higher and high zone which are associated with high levels of tension. This high level of tension will make the muscles appear stressed such as when anger occurs, which is why Z3 and Z4 are used to stimulate the expression of anger. The facial expression is stimulated and affected by these signals classified into zones. For example, if the signal reaches Z3 or Z4, then the 3D humanoid will be in anger mode, and it will change the 3D humanoid's face into a state of anger and will also stimulate the magnitude of the force of vibration to represent anger. Otherwise, if it decreases to Z2 and Z3, then the stimulation will be changed to happiness which will change the 3D humanoid's emotional state to happiness mode with a happy facial expression. The magnitude of the force and acoustic effects will be adjusted as well according to the emotional state.
As mentioned in the previous section, inputs into this system come from a 5DT glove or a Nia mind controller, which act as a stimulator from the user. The user will be provided with a choice, whether they want to use the glove or the mind controller or both. However, the glove is only used as the intensity controller for emotional expression, e.g. if the emotion is anger, then it will create a rounded fist gesture that able to strengthen the intensity of anger. On the other hand, this intensity affect the magnitude force power as well.
Features for controlling the intensity of the avatar's emotion with a data glove are based on calculations of each sensor position. The fingers' movement will be read by a sensor which will then send a signal containing data pertaining to the position of each finger. The finger position value will be used to calculate the intensity of emotion by comparing this data with the maximum value from the finger sensor. Only two emotions are involved in the stimulation process: anger and happiness. That is why, to use these intensity values, a threshold needs to be defined in order to classify the intensity. This threshold is divided into two: the anger threshold and the happiness threshold. The anger threshold is defined as the minimum value for the finger position for it to be considered as having the anger shape (refer to Figure 13.).  Figure 14 is a neutral position where the value for each finger element is equivalent to 255 (the maximum value for the finger position). Based on the finger tracking using a 5DT glove, it is found that the threshold for anger and happiness has a similar value. If the finger as shown in Fig.14. has a maximum value=255, it can be assumed that the finger value in Fig.13. is half that of the maximum value as shown in Table 2. On the other hand, the happiness mode is a bit different because all the fingers except the thumbs need to be closed tightly, while the thumb stays at a middle position between fully closed or open (flat position). Consequently, the intensity is only calculated from the position of the thumb. Thumb Index Middle Ring Little Threshold  1  Anger  128  128  128  128 128  128  2 Happiness 128  0  0  0  0  26   Table 2 Threshold for Anger and Happiness

No Emotion
The threshold for anger is 128, and it will decrease until zero after which the hand shape reaches a fully closed tight hand shape as shown in Figure .15. On the other hand, the happiness threshold starts from 128 for the thumb and zero for the other finger values then the thumb value can increase up to 255 to perform the strongest happiness as shown in Figure 16. The value of the finger position determines the strength of the emotion performed by the facial expression. Figure  17. is pseudocode that describes the details of the finger sensor reading process using function fdgetSensorScaled.  Each finger is captured by two driver sensor indices that are different to one another. The smallest number starts from the thumb and the number for the driver sensor index rises to 13. Numbers 16 and 17 are correlated with pitch and roll capture. In Table 3., glove sensors are exposed and almost all sensors are correlated with the intensity control of emotion except pitch and roll. 'Flexed' is a condition of the finger in an open position and 'unflexed' is a condition where the finger is in a closed position as shown in the previous example (Figure 13-14  The data from finger tracking as discussed before ranges from 0 to 255. The value of the finger sensor can vary, as shown in Table 4.   No  Thumb  Index  Middle  Ring  Little  1  136  93  96  83  99  2  134  93  94  83  98  3  134  92  93  82  97  4  129  90  92  81  95  5  126  89  92  80  95  6  124  88  91  80  92  7  120  88  91  81  92  8  118  88  92  82  92  9  117  89  94  82  93  10  117  91  97  83  94   Table 4. Data sample for finger tracking The facial appearance of the avatar will change according to finger position movements. Refer to Fig.18.-22 for the illustration of interpreting emotion with the mind controller and controlling intensity of emotion using a glove. The mind controller will record the tension level of the user's facial muscle and brain activity. If the tension reaches an anger zone (Z3 or Z4, refer to Figure 9.) then the emotion will be recognized as anger. The glove is designed to capture finger shape, e.g. if the user is trying to make a "fist" in the glove, it will be interpreted as "intensity for anger," then the intensity of the anger will change according to this value. On the other hand, if the muscle or Alpha signal is drops to Z1 or Z2 then the emotion will be interpreted as happy. Fig.19 illustrates how the user is trying to show the emotion "happiness" by smiling and controlling the intensity of emotion by raising the thumb finger. The positions of the finger that can be considered are in two main forms i.e. full fist form for anger emotion and a 'thumbs-up' position.

Conclusion
The feedback from users is very exciting with 67% of users giving a strong and positive response to the system. The utilization of the brain interface and glove is believed to give a strong impression and believability to users in the real world and even strengthen the interactivity and immersiveness of a virtual reality or a robotic application itself. This may be because natural interaction is more attractive and more interesting for most users of games or virtual realities. This work has wide scope for future development, especially if it is used to express another emotion such as sadness, disgust, surprise or an even extreme expression. There are other signals, which have not been used in this study experiment such as beta, Mu, Theta and Delta. The future of detailed emotion recognition will be handled in a further study along with the growth of an emotion recognition process.