Efficient and robust approaches for three-dimensional sound source recognition and localization using humanoid robots sensor arrays

Efficient and robust sound source recognition and localization is one of the basic techniques for humanoid robots in terms of reaction to environments. Due to the fixed sensor arrays and limited computation resources in humanoid robots, there comes challenge for sound source recognition and localization. This article proposes a sound source recognition and localization framework to realize real-time and precise sound source recognition and localization system using humanoid robots’ sensor arrays. The type of the audio is recognized according to the cross-correlation function. And steered response power-phase transform function in discrete angle space is used to search the sound source direction. The sound source recognition and localization framework presents a new multi-robots collaboration system to get the precise three-dimensional sound source position and introduces a distance weighting revision way to optimize the localization performance. Additionally, the experiment results carried out on humanoid robot NAO demonstrate that the proposed approaches can recognize and localize the sound source efficiently and robustly.


Introduction
Humanoid robots have been designed to interact with people and react to environments. As a representation of its intelligence, auditory perception is essential for its recognition of environmental changes. For example, humanoid robots in RoboCup competition need to recognize the whistle blown by referee as a signal to start the game. In this situation, the technology of sound source recognition and localization (SSRL) is employed to classify if the audio type is whistle and localize the accurate sound source position to avoid misidentification of whistle from other fields. Due to the fixed sensor arrays in humanoid robots and the fact that a sound event occurs at an arbitrary direction in three-dimensional (3D) space with noise and reverberation, there comes challenge for SSRL. In addition, a lot of techniques like visual recognition need to be carried out simultaneously with the given computation resources limited humanoid robots. As a result, the computationally efficient and robust real-time 3D SSRL method is increasingly required.
In the last few years, many SSRL algorithms have been applied to intelligent robots. [1][2][3] Confronted with sound recognition, frequency domain analysis is frequently adopted. 4 Detected principal frequency components are classified to predefined frequency domain space to determine whether it is the certain audio type. Sensor arrays in humanoid robots, however, are usually fixed and equipped with low sample rate microphone. Thus, false recognition often occurs. To solve the problem, a new sound source recognition method which is not limited to frequency discrimination has been adopted in this article.
Despite the sensor arrays differ between various applications, commonly used sound source localization (SSL) cues are inter-channel time difference (ICTD) and interchannel level difference. In Rascon and Meza 5 and Argentieri et al., 6 a survey on SSL in robotics is presented. The typical ways using ICTD such as generalized crosscorrelation phase transformation (GCC-PHAT) [7][8][9] and steered response power-phase transform (SRP-PHAT) 10,11 estimate the sound source by time difference of arrival (TDOA) feature. Some multiple signal classification (MUSIC)-based sound localization ways are presented in Takeda and Komatini, 12 Birnie et al., 13 and Hoshiba et al. 14 The basic principle of MUSIC algorithm is to decompose the covariance matrix of the output data, obtaining signal subspace and noise subspace. And the incident direction of the signal is estimated by using the orthogonality of the two subspaces. However, these methods are often hard to meet strict performance requirements in humanoid robots, for example, limited computation resource, noisy environment, and the requirement of reliability and efficiency.
Some applications are implemented on robots equipped with fixed microphone array configurations in Valin et al., 15 Lee et al., 16 Grondin and Michaud,17 and Bustamante et al. 18 In Valin et al., 15 the mobile robot can localize the sound source over a range of 3 m and with a precision of 3 . In Athanasopoulos et al., 19 an application on humanoid robot NAO is presented, which reaches a precision of 5 to locate the sound source at the same elevation as the microphone array's horizontal plane. Furthermore, with more computation resources, a deep neural network for SSL is introduced in He et al., 20 Zhang and Wang, 21 Yalta et al., 22 and Sun et al. 23 The methods mentioned above are still unable to meet the requirements in RoboCup soccer robot competition. In RoboCup SPL League, all teams must use NAO humanoid robots manufactured by Soft-Bank Robotics. The match is held on a total area of length 10.4 m and width 7.4 m. The whistle may be blown from same field or neighboring field as robots. So the strict requirements for distance range and positioning accuracy are challenging for SSRL algorithm. In 2019, a directional whistle challenge was held by organizing committee to investigate the possibilities of localizing the point where the referee whistle is blown. The details can be found on https://spl.robocup. org/technical-challenges-2019/. This article will take RoboCup SPL League competition and whistle challenge environment as one of the testbench of the methods proposed.
In this article, we propose an SSRL framework addressing the SSRL in humanoid robots. A crosscorrelation-based recognition way is given to classify the audio type and a multi-robots collaboration system is used to locate the sound source. The cross-correlationbased recognition way takes prerecorded reference signal as basic clue rather than the frequency domain feature to distinguish the audio type. The single robot sound source direction estimator (SSDE) estimates the source direction by searching maximum SRP-PHAT function in discrete angle space while 2D multi-robots SSL algorithm and 3D SSL determine the source position. The SSRL aims at resolving the SSRL problem under indoor and outdoor circumstances. Figure 1 shows the structure of SSRL.
The contribution of this article is included as follows: (1) A robust and anti-noise sound recognition algorithm based on cross-correlation feature. (2) Improved SRP-PHAT function and simplified discrete angle space search with less computation cost. (3) Multi-robots collaboration localization system attaching distance weighting revision for more precise localization result. (4) Real-time application on humanoid robots NAO equipped with microphone arrays like human.
The rest of the article is structured as follows. In the second section, we discuss sound recognition algorithm  in SSRL, and the cross-correlation algorithm with noise filter is explained. We introduce SSL algorithm including 2D SSL and 3D SSL in the third section. Enhanced SRP-PHAT function and distance weighting revision is also presented in the third section. Experiments and result analysis are shown in the fourth section. Finally, in the fifth section, the article is concluded with a discussion and possible future research directions.

Sound source recognition algorithm
Considering the existence of noise and reverberation mixed in sampled audio data, the misidentification may occur if we classify the audio type by only detecting the principal frequency components. The main procedure of sound source recognition algorithm in this article is demonstrated in Algorithm 1. The audio data are acquired and sampled from microphone in robot's head. After transferring the time-domain signal to frequency-domain signal using fast Fourier transform, a cross-correlation function between the signal and prerecorded reference signal is calculated to identify the certain audio type if its value is over the presetting threshold.

Preprocessing: Noise filter
The noise in audio data origins from two aspects: the external noise from noisy environment and the internal noise caused by the processor's ventilation fan, gyroscopes, and so on. It is essential to reduce the noise using noise filter before the recognition. 24 A band-pass filter is carried out in our algorithm and the upper-lower limits are determined by the audio type. For instance, the frequency of whistle is 2.5-3.5 kHz in most cases, so we can set the limits of band-pass filter f L À f H as 1.5-4.5 kHz to filter out the noise. The noise filter procedure is executed before calculating the cross-correlation function to suppress the noise interference effectively.

Cross-correlation function
The audio type identification cue in SSRL is crosscorrelation function [25][26][27] between the sampled audio data segments and prerecorded reference sound source.
Given two sound source signal x 1 ðnÞ and x 2 ðnÞ, while x 1 ðnÞ is the audio segments to be analyzed and x 2 ðnÞ is the reference audio segments. The sampling frequency is both f and the number of samples is N.
For two continuous signals, the cross-correlation function is defined as equation (1) When we process the audio data in humanoid robots, the discrete sampling is carried out firstly. The crosscorrelation function of two discrete signals is defined as equation (2), and the length of Rðt n Þ is 2N À 1 However, the method below to calculate the crosscorrelation is time-consuming in humanoid robots with limited computation resources. Hence, we need to do the calculation in frequency domain.
According to Wiener-Khinchin theorem, the autocorrelation function of a wide-sense-stationary random process has a spectral decomposition given by the power spectrum of that process where Pð!Þ is the cross power spectrum of x 1 and x 2 Algorithm 1. Sound source recognition.

max max
It can be simplified by the exchange integral property and the shift property of Fourier transform Thus, the cross-correlation function can be presented as And its discrete representation is For any audio data segment from microphones, we calculate the cross-correlation function between it and the reference sound. The signal is identified as type T if the ratio of maximum value of cross-correlation function max Rðt n Þ ð Þ to maximum value of reference sound autocorrelation function max R 0 ðt n Þ ð Þis larger than the presetting threshold of type T The empirical value of threshold is usually set to be 0.4 in practical use. As a result, we can deduce the audio type accurately based on the cross-correlation feature. The cross-correlation-based way can be more robust compared with principal frequency components detection-based way.

SSL algorithm
In this section, the proposed SSL algorithm is described. It consists of a single robot SSDE and multi-robots collaboration SSL algorithm. In "Single robot SSDE" section, a single robot SSDE will be introduced. "Multi-robot 2D SSL using distance weighting revision" and "Threedimensional sound source localization" sections give more detailed introduction about 2D SSL algorithm and 3D SSL algorithm.

Single robot SSDE
As is depicted in Figure 2, the humanoid robot NAO is equipped with four microphones. The array configuration is analogous to human-ears as they are distributed on left and right sides. An important cue for SSL is the TDOA, 2,28,29 but it is not enough to locate the sound source based on the TDOA model between two microphones because the cone of confusion will lead to a mirror-image position. The classical GCC-PHAT estimates the angle by TDOA under the assumption of farfield. 30 As Figure 3 shows, the angle of sound arriving at mic1 and mic2 is a, and the difference of arrival distance is cDT , so we can infer the direction of sound source according to the geometric relationship based on TDOA and distance between microphones. In this section, we propose a single robot SSDE using all the four microphones to distinguish the source direction.
The theoretical basis of SSL is the TDOA model of each microphone pairs. When the sound source S i is close to the microphone pairs, the sound wave received by microphone M i can be considered as a polar wave. Under the polar wave propagation assumption, time of arrival of each microphone can be estimated as the ratio of sound-to-microphone distance to sound speed c in air The coordinates that describe the source direction are shown in Figure 4.
Let the height of source be h, SSDE describes the source direction using azimuth a and elevation b with relative to robot's coordinates. The SRP-PHAT feature is unique for each direction in discrete angle space, so the direction can be deduced by a set of microphone signals.
Given X M ðnÞ as the signal segments received by the M th microphone and qðx; y; zÞ as the assumption sound source position, the SRP-PHAT feature function 31 is defined as equation (10) whereR lm t lm ðqðx; y; zÞÞ ½ is the generalized crosscorrelation function between the lth and mth microphone signal asR where F m ðwÞ is the fast Fourier transform of X m ðnÞ. t lm ðqðx; y; zÞÞ is the TDOA between the lth and mth microphone signal. Let r l and r m represent the position of microphone, c is the sound propagation speed in air (usually 343 m/s), then t lm ðqðx; y; zÞÞ is given by Therefore, the estimation of real sound source position can be presented as equation (13) As is mentioned before, it is time-consuming if we search in the 3D space coordinates point by point. Then, we propose SSDE to deal with the problem by discretizing the space Q to discrete azimuth-elevation angle space Q 0 ða; bÞ.
As the sound source height is assumed to be h, we can get tðq 0 ða; bÞÞ by using q 0 ða; bÞ. Given r o as the head chain origin position in world coordinates, the 3D coordinates of q 0 ða; bÞ can be approximated as equation (14) q Now we are able to write down the SSDE in discrete angle space as equation (15) It has to be noted that the height assumption may bring some estimation error but the searching method in discrete angle space reduces the computation load so we can realize the real-time processing in humanoid robot NAO. In "Multi-robot 2D SSL using distance weighting revision" and "Three-dimensional sound source localization" sections, we will revise the 3D position to improve the result precision. The SSDE algorithm is concluded as Algorithm 2.
Besides, the localization resolution in discrete angle space is limited by the humanoid robot's microphone array and sampling frequency since we use TDOA as our localization cue. 32 Suppose the maximum frequency as f max , and the distance between the lth and mth microphone is d lm . The maximum time delay which can be measured is d lm c , so d lm c Â f max kinds of angle can be distinguished under the maximum frequency. Given M microphones, the possible maximum resolution can be The distance between microphone arrays and the sampling frequency are the main factors that limit the SSDE resolution. As a result, the discretization interval should be set larger than Dq in angle space.
Multi-robot 2D SSL using distance weighting revision SSDE provides the estimated sound source direction relative to single robot. SSRL introduces a multi-robot collaboration SSL algorithm using SSDE. As is shown in Figure 5, a 2D position can be estimated by crossing the azimuth angles and distance weighting revision. After combining with the elevation angle, the height can be revised and we can get a more precise 3D position. The 2D SSL algorithm based on distance weighting revision will be presented in this section.
It needs to be stated that the robots rely on Wi-Fi communication. After each robot finish processing the sound signal, the angle information will be sent to the master robot who is responsible for executing 2D SSL and 3D SSL algorithm in real time.
To illustrate 2D SSL, let's assume there are R humanoid robots NAO (R ¼ 3 in this article) with initial pose L r ¼ ½x r ; y r ; q r , r ¼ 1 ::: R. For each robot, the relative sound source direction q r ða r ; b r Þ is calculated using SSDE, where a r is the azimuth and b r is the elevation.
Intersect R robots' azimuth angle rays, and we can get C 2 R intersections P i ,i ¼ 1 ::: C 2 R . After taking average, an estimated 2D position is achieved as equation (17) When the sound source distance is far greater than the microphone interval, we can analyze the sound propagation by assuming microphones are configured at the head chain origin point. For angle resolution sector region, the larger the radius, the longer the arc will be ðL ¼ r Â aÞ. All points in the arc will be regarded as the same angle in current angle resolution. That is to say, in the same angle resolution zone, the closer the microphone is to the sound source, the smaller the arc length L and the smaller the area it represents. Correspondingly, the small shifting of sound source Dq will be reflected in small deviation of angle Da. Thus, closer distance to microphone means higher credibility of the identified sound source direction.
Using this criterion, 2D SSL revises the estimated 2D position by distance weighting revision. For each robot, the distance between it and uncorrected sound source position is calculated as d r , r ¼ 1; ::: ;R. Then, we select the closest one to revise the position. The corrected position can be acquired by rotating the uncorrected pose to the azimuth ray of the closest one where r i is the index of the robot with closest distance to sound source. In 2D SSL, not only all robots' sound direction information is combined into uncorrected position, but also the closest robot distance with corresponding angle is used to correct the position. The corrected 2D position combines the information of distance, and it is more reliable than the uncorrected one. 2D SSL in this section provides the 2D estimated position of sound source, and we will infer the 3D estimated position in the next section.

Three-dimensional sound source localization
This section aims at deducing the 3D estimated position of sound source. Once we get the 2D position from 2D SSL, we can combine it with the elevation and work out the height of sound source to revise the height assumed before. Figure 6 shows the 3D SSL coordinates. The main procedure is described as follows: Given two vectors with fixed starting pointsñ and b ! , whereñ is the normal vector of horizontal plan with staring point as P corrected and direction as any normal vector. And b ! is the elevation ray vector with starting point as L r i and direction as elevation direction b r i ! .
Intersectñ and b ! , and we can get P cross , which can be assumed as the 3D estimated sound source position P 3D .
The whole 3D SSL algorithm is concluded as shown in Figure 7. SSDE works out azimuth and elevation using acquired microphone signal firstly. Combining angle information and robots' position, 2D position is deduced by 2D SSL. Next, the elevation angle is used to revise the height and get 3D position by 3D SSL.
The main procedure is described in Algorithm 3.

Experiments and analysis
We perform two experiments and one application to evaluate the proposed SSRL framework. In the first experiment, we compare the proposed sound recognition algorithm with another classical algorithm based on short-time Fourier transform (STFT). 33,34 We perform a numerical analysis to show the accuracy and anti-interference ability of our system. Then, we test our framework in the indoor and outdoor environment to evaluate the performance of SSDE. We compare SSDE with GCC-PHAT-based algorithm and classical SRP-PHAT-based algorithm to illustrate the efficiency and robustness of our framework. Additionally, we apply the proposed system for one application in humanoid robots NAO and test it on RoboCup SPL Technical Challenge.

Sound source recognition test
We evaluate our sound recognition algorithm on humanoid robot NAO(V5) which is equipped with four microphones on its head. The configuration parameter of microphone arrays is presented in Table 1.
We put the robot in indoor environment and compare our algorithm with the STFT-based way. The whistle, speech, and clap sound are collected to test the whistle recognition accuracy and anti-interference ability. To test the anti-interference ability of our algorithm, we collect the same kind of whistle with noise. The noise is simulated by sound player which is located at 0.5 m away from the robot and test audio is set to be played at 1 m away from the robot. The level and frequency of audio is presented in Table 2. The frequency of clap sound and speech is overlapped with whistle, that is why STFT does not work in some cases.
The audio sampling frequency is set to be 48 kHz in this experiment. The sampled microphone signal including whistle, speech, and clap is normalized as shown in Figure 8. The signal segments are different in frequency as we can clearly see from the figure and the data are consistent with what we provided in Table 2. The frequency of whistle and clap sound is very close to each other.
The STFT-based method detects the principal frequency components using STFT. For instance, the STFT frequency spectrum of a section of whistle is shown in Figure 9. The range of whistle identification is 2.5-3.5 kHz, so the signal   can be convinced to be whistle if the principal frequency is within the range. The SSRL proposed in this article uses the crosscorrelation function between current audio segment and reference whistle as the standard to identify the whistle. The ratio of cross-correlation function between two signals to reference signal auto-correlation function with a whistle blowing event is shown in Figure 10.
Two hundred samples are collected to compare the two algorithms. The identification results are shown in Table 3 and Figure 11.
The criterion used to judge the accuracy in this article is And the criterion of anti-interference ability is As we can see from the result, the two algorithms have different performance on the test set. The STFT-based recognition method gets an accuracy of 0.52 and antiinterference ability of 0.89 while SSRL proposed in this article can reach an accuracy of 0.75 and anti-interference ability of 0.98. Obviously, SSRL outperforms STFT in accuracy and anti-interference ability.

SSDE test
In this section, we will evaluate the efficiency and robustness of three source direction estimator including SSDE, GCC-PHAT, and SRP-PHAT in indoor and outdoor     environment. The angle estimation accuracy and time cost are measured to distinguish which algorithm is better. To test the robustness of the algorithm, the corresponding audio is collected in the indoor and outdoor environment. The indoor environment is set in a totally enclosed room of length 13.0 m and width 8.0 m, while the outdoor environment is set in a very large stadium. There is reverberation caused by wall reflection in the indoor environment, while there are noisy voices in the outdoor environment.
Both SRP-PHAT and SSDE use the same kind of SRP feature. Classical SRP-PHAT algorithm estimates the source direction by searching in the 3D space using 3D coordinates while SSDE searches in angle space. The SRP-PHAT searching step is set to be 0.05 m in this experiment, and the searching domain is 9 Â 6 m 2 with height of 1.3-1.9 m.
In indoor environment with reverberation, sound source in different azimuth-elevation position is acquired to test the angle estimation effect of three algorithms. Only some selected angle ranges are tested and results are shown in Tables 4 and 5. The error distribution and time cost are shown in Figure 12 and Table 6.
Similarly, we set up the same experiments in outdoor environments and test the three algorithms. The results are shown in Tables 7 to 9 and Figure 13.
The different error rates for different angles may be caused by the random noise, but it is clear to see from the results that angle estimation error of SSDE is less than GCC-PHAT and average time cost of SSDE is less than SRP-PHAT. In summary, GCC-PHAT-based way is faster than other two methods but performs not well in angle estimation in complicated environment. The SRP-PHAT-based way performs well in angle estimation but it has low efficiency and costs too much time in searching. For our proposed SSDE, it shows higher accuracy and efficiency and it is robust to both indoor and outdoor environment.

Application
We apply SSRL for 3D SSL on humanoid robots NAO. To test the recognition and localization simultaneously, we design a whistle recognition and localization experiments in robot soccer competition field.  The experiment environment configuration is shown in Figure 14, robots are put in the predefined initial position and stay still. Robot 1 is placed at (2.3 m, 1.7 m) facing at À62.1 , robot 2 is placed at (À1.5 m, À1.3 m) facing at 40.9 , and robot 3 is placed at (À3.3 m, 0.6 m) facing at 0 . Once the referee blows the whistle in any place, the robot will detect the signal and recognize the whistle. Meanwhile, the audio data segments from all microphones will be used to locate the whistle. The robots will contact with each other via Wi-Fi and share their SSDE result. Finally, the robot will calculate the 3D sound source position using 3D SSL, and the console will present the result.
Three indexes are used to measure the localization result in this article: The first one is the 3D absolute position error, the second one is the relative distance error to the robot with closest distance to whistle, and the third one is the relative azimuth rotation error.
Suppose the actual position of whistle is q GT , and the SSRL estimated position is q S , then three indexes will be defined as equation (21) where L r min is the position of robot with closest distance to whistle position, a GT is the angle to whistle position in closest robot's coordinates, and a S is the angle to estimated position in closest robot's coordinates. All errors are normalized to their absolute value. We take robot soccer competition field as the test field and place the robot at set place in Figure 15. The whistle is blown by a standing adult at every 0.5 Â 0.5 m 2 area. The yielding error distributed throughout the whole field is shown in Figures 15 to 17. In field with size of 9 Â 6 m 2 , the error is very small in most areas but there still exists some points with large error. The possible reason may be the misidentification caused by too close or too far distance and the accumulated angle estimation error of robots.
The average error of three indexes is shown in Table 10.
In addition, we applied the SSRL in humanoid robots NAO and took part in RoboCup2019 Standard Platform Directional Whistle Technical Challenge. Real competition field is set at a big exhibition hall and the whole area (18 Â 30 m 2 ) contains four SPL standard filed. There are walking spectators and noises in the hall so it brings a lot of challenges to our system. Some of the localization data are shown in Table 11.
As we can see from the result, using humanoid robots microphone arrays, the SSRL framework proposed in this article can locate the sound source in both indoor and outdoor environments. The 3D distance absolute error is approximately 0.8636 m, the relative distance error is 0.6760 m, and the relative azimuth angle error is 4.8325 . In directional whistle technical challenge, the minimum distance error can reach to 0.049 m and the minimum angle error can achieve 1.352 .
The SSRL algorithm proposed in this article is proved to be able to locate the sound source in indoor and outdoor environments on humanoid robots NAO with limited computation resources and critical realtime requirement after application testing. Compared with other methods, the SSRL achieves higher efficiency and robustness.     . SSRL test environment. The robot is standing on the robot soccer standard platform competition field and the whistle will be blown by the referee. SSRL: sound source recognition and localization. Figure 15. Error distribution map of e 1 Figure 16. Error distribution map of e 2

Conclusions
This article presents an SSRL framework approach implemented for 3D SSRL on humanoid robots. The humanoid robots microphone arrays are fixed and configured like human-ears, and limited computation resources make humanoid robots SSRL a challenging problem. This article addresses the problem by following ways. The sound recognition based on cross-correlation feature classifies the sound type using sampled microphone signals. And the sound source direction is determined by a single robot SSDE which employs a new discrete angle space SRP-PHAT function. Aiming at getting the 3D position of the sound source, a 2D multi-robots SSL is firstly presented with distance weighting revision. The following 3D multi-robots SSL calculation works out the certain 3D sound source position.
Compared with other methods, SSRL is proved to be more efficient and robust after experiments test. However, extensive research is still necessary to further improve the system accuracy and robustness. Deep neural networksbased methods in the literature [20][21][22][23] inspire us to extract deep feature of audio data and use it as a new cue to locate the sound source. We believe that more progress on SSRL will be made to transcend the field and improve robot audition in general.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the National Natural Science Foundation of China (grant nos 61673300, U1713211 and 61733013).