An interactive system for humanoid robot SHFR-III

The natural interaction between human and robot is full of challenges but indispensable. In this article, a human–robot interactive system is designed for humanoid robot SHFR-III. The system consists of three subsystems: multi-sensor positioning subsystem, emotional interaction subsystem, and dialogue subsystem. The multi-sensor positioning subsystem is designed to improve the positioning accuracy, the emotional interaction subsystem uses bimodal emotional recognition model and fuzzy emotional decision-making model to realize the emotion recognition and expression feedback to the interactive objects, and the dialogue subsystem with personal information can complete the response consistent with the default information and avoid conflicting responses .The experimental results show that the multi-sensor positioning subsystem has good environmental adaptability and positioning accuracy, the emotional interaction subsystem can achieve human-like emotional feedback, and the dialogue subsystem can achieve more natural, logical, and consistent responses.


Introduction
Human-robot interaction (HRI) was first proposed in 1975. 1 HRI is an important interdisciplinary research field of computer, ergonomics, cognitive science, and other disciplines. It is also an important content of engineering psychology research. At present, HRI is developing toward personification, intellectualization, and naturalization.
Many researchers are now dedicating their efforts to studying interactive modalities such as facial expressions, natural language, and gestures. This phenomenon makes communication between robots and individuals become more natural. 2 Gunes et al. 3 analyzed human participants' nonverbal behavior and predicted their facial action units, facial expressions, and personality in real time while they interacted with a small humanoid robot. Ali et al. 4 designed a sign language educational humanoid that possess stereo vision and stereo microphones along with stereo audios for intuitive interaction. De Jong Michiel et al. 5 designed a humanoid robot for social interaction, they combined vision, gesture, speech, and input from an onboard tablet, a remote mobile phone, and external microphones. Kumra and Kanan 6 presented a novel robotic grasp detection system that predicts the best grasping pose of a parallelplate robotic gripper for novel objects using the RGB Depth Map image of the scene. This article considers the sustainability of the product 7 and conducts SHFR-III interactive research from the perspective of positioning, emotion, and dialogue.
Robots should have lightweight, low-energy consumption, and excellent performance, 8 so we designed and built a humanoid emotional robot SHFR-III (see Figure 1). 9 SHFR-III has 22 degrees of freedom and can realize 8 basic expressions, including calm, happiness, and so on.
Target positioning is an important part of the research of humanoid robots. 10 In HRI, first, robots need to recognize the interactive objects from the environment. Laurenzi et al. 11 introduced a set of modules based on visual object localization. Deepak Gala et al. 12 used auditory sensors for positioning. However, the positioning effect of single sensor is greatly affected by external factors, 13 such as auditory positioning is limited by noise and visual positioning is limited by illumination. Therefore, multi-sensor system can reduce this situation.
In real life, people usually do not judge each other's emotional state based on single modal information. Visual information and voice signal information are very important for emotional judgment. Multi-modal emotion recognition is to recognize emotional states by using multi-modal information. 14 Emotional computing was first proposed by Picard. 15 Emotional computing can measure and analyze the external performance of human emotions and affect them.
As argued by Vinyals and Le, 16 the current conversation systems are still unable to pass the Turing test, and the lack of consistent personal information is one of the most challenging constraints. In recent years, Li et al. 17 learned interactive object-specific conversational styles by embedding users into sequence to sequence model. Al-Rfou et al. 18 used similar user embedding techniques to simulate user personalization. Both studies required conversational data from each user to simulate her/his personality. Qian et al. 19 used bidirectional decoders to generate predefined personality, but a lot of data are needed to mark the information location.
To make HRI more natural and harmonious, this article makes the following contributions for SHFR-III: A multi-sensor positioning subsystem is designed to reduce dependence on work environment and improve the overall positioning accuracy by data fusion of multiple sensors. The emotional recognition model based on facial expression and speech is used to deal with situations that a single modal would fail. At the same time, fuzzy algorithm is used to simulate emotional decision-making. Using default information to solve the problem of inconsistent personal information in dialogue, and maximum mutual information is taken as the objective function to reduce meaningless replies in the dialogue model.

Overview
Our interactive system works as follows (see Figure 2): First, we use multi-sensor positioning subsystem to find the exact location of the interactive objects. Then the robot adjusts the angle between the interactive objects, and the robot then obtains the facial and voice information of the interactive objects through cameras and microphones. The emotional interaction subsystem recognizes their emotional state and emotional decision-making, and SHFR-III will display the results of emotional decision-making with facial expression. The dialogue subsystem with personal information generates responses and presents them in the form of a voice.

Multi-sensor positioning subsystem
In this article, a multi-sensor positioning subsystem is designed, which includes an infrared positioning module, an auditory positioning module, and a binocular vision positioning module.

Design of positioning module
Infrared positioning based on infrared sensor array. Four pyroelectric infrared sensors with the same parameters are selected to form the sensor array as shown in Figure 3. The vertical equidistant distribution of the four sensors and the horizontal angle of the adjacent sensors are 30°. The output results of the four sensors from top to bottom are recorded as S 1 , S 2 , S 3 , and S 4 , respectively. The sensor array can cover the frontal range of 0-210.
The positioning function Fr (S 1 , S 2 , S 3 , S 4 ) is defined as the mapping relationship between the sensor output values S 1 , S 2 , S 3 , S 4 and the target location information q. The output values of S 1 , S 2 , S 3 , and S 4 are 1 or 0, which correspond to the high and low level of sensor output. Therefore, the Fr (S 1 , S 2 , S 3 , S 4 ) can be described by the truth table as presented in Table 1.
Auditory positioning based on auditory sensor array. The three sensors are isosceles triangular in the vertical plane. Sensors 2 and 3 are arranged horizontally, and the distance is p. Sensor 1 is located on the vertical line of B and C, and the distance is q. Sound source is recorded as P(x, y, z). The distribution model is shown in Figure 4.
The time difference of sound arrival between sensor 2 and sensor 1 is defined as t 21 , and the time difference of sound arrival between sensor 3 and sensor 1 is t 31 . Based on the sampling frequency of sensor (50 KHz)and sound velocity (340 m/s), the relationship between time difference of sound arrival T and distance difference D is as follows As we know, where PA ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi The distance from the source P to the origin is r ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffi , we have the following equation      ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi The polar angle of the sound source P relative to the origin is q.
The polar coordinates of the sound source (p, q) are as follows Visual positioning based on binocular stereovision. The model of binocular vision positioning is shown in Figure 5. The focal length of the left and right eyes is (f x1 , f y1 ) and (f x2 , f y2 ), respectively. The coordinates of the center of light are (c u1 , c v1 ) and (c u2 , c v2 ), respectively. Binocular spacing is 2B. Left and right eye cameras have the same model and focal length, which can be approximately considered equal.
The transformation relationship between coordinates in the plane (u, v) and coordinates in the world coordinate system x w ; y w ; z w ð Þis as follows The parameters of the left and right eyes are introduced into the upper equation. We have the following equation Fusion strategy analysis of multi-sensor positioning subsystem The working environment of the system is a closed room of 5 Â 6 m 2 . The coverage area of the positioning subsystem is shown in Figure 6, where O is the positioning center of the multi-sensor subsystem and DEGF represents the whole room. The coverage area of visual positioning module is the triangular area ABC, and polar angle positioning error (E q ) and distance positioning error ðE r Þ increase with the increase of target distance. The coverage of infrared positioning module is MNFG, and E q decreases with the increase of target distance. The coverage of the auditory positioning module is DEGF, and E r increases with the increase of the target distance, while E q remains unchanged. According to the working characteristics of multi-sensor positioning subsystem, the multi-level positioning fusion method shown in Figure 7 is proposed. In triangular ABC region, all three positioning modules work normally, and  the three-positioning data are fused. In rectangular MNGF region, only infrared positioning module and auditory positioning module can work, at this time, the two-positioning data are fused. In rectangular DENM region, only the auditory positioning module can work normally, and then the auditory positioning data can be output directly.

Weighted fusion algorithms with variable weights
The final result of the multi-sensor positioning subsystem, that is, the coordinates of the interactive target relative to the robot in the horizontal plane are expressed in polar coordinates r; q: The positioning data of infrared positioning module is ðq r Þ), the positioning data of auditory positioning module is ðr s ; q s Þ, and the positioning data of visual positioning module is ðr v ; q v Þ. The weighted fusion algorithm is constructed as follows where m r , m s , and m v are the angle weighting coefficients of infrared positioning module, auditory positioning module, and visual positioning module, respectively. n s and n v are the distance weighting coefficients of auditory positioning module and visual positioning module, respectively. In this article, three kinds of positioning modules are tested separately, and the corresponding positioning accuracy is calculated. The average E q of infrared positioning module is 15°ðE qr Þ; the average E q of auditory positioning module is 3°(E qs ), and the average E r is 340 mm ðE rs Þ; the average E q of visual positioning module is 1°(E qv ), and the average E r is 100 mm ðE rv Þ.
The weighting coefficients are inversely proportional to the positioning accuracy, and the weighting coefficients are calculated according to the sum of the coefficients being 1.
The weighted fusion equation is as follows In equation (3), the weighting coefficient is constant. This equation is only applicable when all three positioning systems are working normally. However, as the location of the interactive target and the external environment change, one or some positioning systems may fail. According to the positioning accuracy of each positioning system determined by experiments, a weighted fusion algorithm with variable weights is proposed.
When the interactive object is out of the location range of visual positioning module or in a dark environment, q r and q s have data, q v has no data, so the equation is follows Only when the auditory positioning module works, it indicates that the interactive object is in a dim environment and out of the infrared positioning module. At this time, the positioning subsystem directly outputs the results of the auditory positioning module.
When in a noisy environment, the auditory positioning module stops working. At this time, when q r and q v have data, it means that the interacting object is in the triangular ABC region and the light is abundant.
Only when the infrared positioning module works, q s is output directly, and then the positioning subsystem has no distance information output.

Bimodal emotion recognition
In this article, facial expression and voice emotion are fused by decision-level fusion.
Facial emotion recognition. Noduls Facial Expression Analysis System (FaceReader) is used for bimodal emotion recognition. This article is based on discrete emotional classification. FaceReader data output format is based on Paul Ekman's six basic expressions plus calm expressions to construct a seven-dimensional probability matrix describing emotional state.
Speech emotion recognition. Speech emotion recognition is a new research hotspot involving traditional speech signal processing, pattern recognition, human psychology, artificial intelligence, and other fields. The research of speech emotion recognition is based on discrete emotion classification system.
Feature extraction: The main emotional features in speech signal are prosodic features, spectrum-based features and sound quality features. 20,21 Based on the study of emotional features of expression, this article extracts the acoustic parameters including the root mean square of energy, zero-crossing rate, fundamental frequency, vocal probability, MFCC, frequency and bandwidth of first, second and third formants, and the corresponding first-order difference as the dynamic parameters of speech signal. Statistical features of acoustic parameters and dynamic parameters are used as feature vectors for speech emotion recognition, and a total of 382dimensional features are obtained.
Hierarchical support vector machine (SVM) classifier: First, all categories are divided into two subclasses, and then these two subclasses are further divided into subclasses until a separate category is obtained. This method can classify subclasses according to the degree of confusion between classes.
The degree of confusion between category i(G i ) and category j(G j ) is Mix ij .
The higher the level of confusion, the more difficult it is to distinguish between category i and category j. When determining the first level of classification, if the degree of confusion is greater than 0.1, the two categories can be divided into one subclass. If all confusion is less than 0.02, this category is treated as a subclass alone.
When the degree of confusion between a certain category and other categories is greater than 0.02 and less than 0.1, the classification situation cannot be directly judged. By calculating the total degree of confusion between the category and a subclass, the classification with high degree of total confusion can be selected.
where a is an emotional category, B is a subclass, and B includes emotional categories.
Decision-level fusion. This article chooses the weighted summation method through experiments.

Fuzzy emotional decision-making model
Fuzzy emotional decision-making takes the initial emotional state and the emotional state of the current interactive object as input and combines the input to make the fuzzy emotional decision through the fuzzy reasoning rules to generate the robot's emotions. This article refers to the way of emotional quantification in reference 22 to quantify this emotional state. Seven emotional states are quantified as interval values. Considering external stimuli, the interval equivalence ratio is expanded to [0,7]. The Mamdani algorithm is used to construct a fuzzy affective decision model.
After analysis, the orthodox distribution curve of Gauss function is more in line with the characteristics that the affective control range is gradually weakening from the central point to the surrounding area. It shows that an emotional input belongs to the membership degree of the fuzzy emotional subset, which is convenient for calculation and processing. Assuming that all the emotional centers have the same influence range and ability, the Gauss function is as follows where c is the position of the central point of the function, s is the width of the number curve, s of each fuzzy subset is the same, and c is different. Based on a male volunteer, the fuzzy rules of orthogonal combination are equationted and the probabilities of each rule are evaluated. As presented in Table 2, the horizontal row represents the emotional state of the robot at the front moment, the vertical row represents the external stimulus, and the numerical value of the emotional state in the table represents the proportion of the rule.

Dialogue subsystem with personal information
In this article, a dialogue subsystem with personal information is proposed (see Figure 8). By giving the chat robot specific personal information, the robot can generate a response consistent with its given information. The system first uses the question classifier to distinguish whether the question needs personal information dialogue model to deal with. If yes, the model retrieves the most similar question

Questions classification
The classification model is used to determine whether the input problem needs to be processed by the personal information dialogue model, which is a two-class problem. It uses P zjx ð Þ z 2 0; 1 f g ð Þ ; z ¼ 1, to express the need for personal information dialogue model, such as: "How old are you this year?"P z ¼ 1jx ð Þ≈ 1 and "How old is your brother this year?"P z ¼ 1jx ð Þ≈ 0. This article adopts the support vector machine model based on word bag.

Personal information dialogue model
This article constructs a personal information dialogue model based on the twin network idea. First, the two objects to be matched are represented by the deep learning model, and then the matching degree of the two objects can be output by calculating the similarity between the two representations. This article uses bidirectional long shortterm memory (BiLSTM) to represent the semantic information of sentences. 23 The loss function used is the comparative loss function, 24 which is often used in twin neural networks. This loss function can effectively deal with the relationship between paired data in twin neural networks. The expression of the comparative loss function is as follows where d ¼ jja n − b n jj 2 represents the Euclidean distance of two sample features, s is the label of whether two samples match, and margin is the set threshold.

Dialogue model based on maximum mutual information
The open domain dialogue model is based on seq2seq. However,if we only rely on the maximum likelihood estimation, even if we train with a large number of data, the seq2seq model is prone to generate security answers like "不知道" ("I don't know"), "哈哈哈"("Ha ha ha"), and "好的"("well"). So we use the anti-language model proposed by Li et al. 25 and take maximum mutual information as the objective function of seq2seq, as shown in the following equations The log P TjS ð Þ has the same representation as the maximum logarithmic likelihood model. llogP T ð Þ is regarded as a penalty for candidate words with high probability for any input, and the penalty is controlled by parameter l. Because of the existence of penalty term, the neural network no longer chooses the words with high probability, so as to avoid generating general answers. However, punishment can affect sentence structure and fluency, so a piecewise function g(k) is introduced as follows Pre-generative words have a greater impact on sentence diversity than post-generative words. In order to ensure the fluency of sentences as much as possible, only the high probability candidate words generated in the early stage are punished in the process of sentence generation by decoder. The g in equation is set to 1. Only the first word of the sentence is punished to ensure the coherence of the sentence as far as possible.
The model consists of encoder and decoder. In the encoder part, two layers of BiLSTM neural network are adopted, the number of units is 512, and the dimension of word vector is set to 300. Bahdanau's attention mechanism was used. In the training process, dropout mechanism is adopted with a retention rate of 0.5, Adam learning rate is set to 0.001, batch size is set to 32, and the number of data iterations is 128.

Experiment
Experimental results of multi-sensor positioning subsystem Each positioning module experiments in different environments.
To verify that the multi-sensor positioning system can be applied to various working environments, target positioning experiments are carried out in normal lighting environment, dark environment and noise environment respectively. The experimental results are given in Table 3. From the experimental results, we can see that the some positioning module designed in this article will fail in different scenarios, but other positioning modules still work normally and have good stability.
Fusion experiment results of multi-sensor positioning subsystem. The multi-sensor positioning system is used to locate points in the environment, and the results are fused. Some statistical results are given in Table 4.
The experimental results show that the positioning accuracy after fusion is higher than that of single positioning system, and the stability of positioning has been greatly improved.

Experiments on emotional interaction
Speech emotion recognition. This article uses CASIA Chinese Emotional Corpus for model training, openSMILE for feature extraction, principal component analysis (PCA) contribution selection 95%, and LIBSVM toolkit developed by Professor Lin of Taiwan University (Kernel function is three-degree polynomial). The degree of confusion obtained by experiments is given in Table 5.
According to the degree of confusion, the hierarchical SVM as shown in Figure 9 is designed.
The comprehensive recognition rate and recognition rate of each level classifier are given in Table 6.
Bimodal emotional fusion. Some data in eNTERFACE'05 multi-modal emotion database is used to verify the validity. The five emotional expressions of happiness, surprise, fear, sadness, and anger are screened by scoring principle. Thirty pieces of data are selected for each emotion for bimodal emotion recognition.
The results of single-modal and bimodal emotion recognition are compared as given in Table 7. The experimental results show that the performance of bimodal emotion recognition is better than single-modal, with an average recognition rate of 59.34%. Table 8 gives the result of emotion recognition.  Figure 9. The hierarchical SVM. When the probability of detecting disgust is the highest, speech emotion recognition is not carried out, and disgust is taken as the final recognition result.
Fuzzy emotional decision-making. After setting the initial emotional state of the robot, with the change of external stimulus, the current emotional state of the robot is simulated and calculated.
When the initial emotional state of the robot is calm, the emotional state of the interactive object changes in turn. The emotional change curve of the robot is given in Figure 10(a). When the robot changes to the state of surprise, as the external stimulus gradually turns to fear, the robot will change to fear. When it encounters the external stimulus of sadness, the robot will then turn to sadness, and keep the sad state under the stimulus of disgust. Figure 10(b) shows the simulation of experimental results with sad initial state.
The simulation results show that under the fuzzy affective decision-making model, the change of the robot's emotional state under the external stimulus is slow and continuous, and the calculation results are in accordance with the reasoning rules and human's emotional changes.

Experiments of dialogue subsystem
In this article, the accuracy, F1 Score (FI) and Area under the curve (AUC) values are selected to evaluate the performance of the question classification model. The values are 87.69%, 0.8754, and 0.8797, respectively.
In the personal information reply model, since five identities are set in this article, accuracy refers to whether the category of sentences with the greatest similarity belongs to the label category, which is 87.4%.
The open domain dialogues are evaluated by manual and bilingual evaluation understudy (BLEU) methods. In the experiment, the penalty coefficient of the maximum mutual information model is 0.5 and g is set to 1.The BLEU results of the two models are 0.17 and 0.25, the mutual information model is better. In the manual evaluation, most people think that the results of mutual information model are similar to those of maximum likelihood model, 30% think mutual information model is better than maximum logarithmic likelihood model.
In the overall dialogue system, this article randomly selected some dialogues (the dialogues are in Table 9), and asked volunteers to evaluate the following aspects: Naturalness: whether the generated response is natural and smooth. Logic: whether the generated response is logically related to the problem. Information consistency: whether the response to personal information is consistent. Variety: whether there are multiple ways to respond to a question.
From Table 10, we can see that our model is superior to the ordinary seq2seq model in each index, especially in information consistency, because this article adds personal information reply model to ensure the consistency of personal information.

Conclusion and future work
This article designs an interactive system for humanoid robot SHFR-III. The system can use multi-sensor positioning subsystem to locate accurately in complex environment, use bimodal emotion recognition model and fuzzy emotion decision-making model to complete human-robot emotional interaction in the form of robot facial expression, and the dialog subsystem with personal information can complete the response consistent with the default information. The system has the advantages of easy implementation and good interactivity, and can be applied in the fields of elder group's care, we can understand their emotional world through HRI, and use chatting to relieve their loneliness. Besides, it can also be used in the field of robot teaching and autism treatment and so on.
Our work is only a small step toward achieving a harmonious HRI, and there are many future directions: Positioning system: Constructing three-dimensional auditory sensor array to improve positioning accuracy and optimize fusion strategy. Emotional interaction: The research of emotional recognition in this article is the discrete emotional recognition. Later, we can study dimension emotional recognition, which is convenient for the joint research with the dimension emotional model, taking into account personal, psychology, morality, and other factors.
Dialogue system: Integrating emotional information into the dialogue system to generate a response with emotional information style. Introducing reinforcement learning into dialogue system, and improving the naturalness and logic of response by adding a "teacher." In addition, specific subsystems should be added for different application scenarios, such as robot hand grasping, motion trajectory control, smart home control, and so on.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.