Abnormal Event Detection Method in Multimedia Sensor Networks

Detecting abnormal events in multimedia sensor networks (MSNs) plays an increasingly essential role in our lives. Once video cameras cannot work (e.g., the sightline is blocked), audio sensor can provide us with critical information (e.g., in detecting the sound of gun-shot in the rainforest or the sound of car accident on a busy road). Audio sensors also have price advantage. Detecting abnormal audio events in complicated background environment is a very difficult problem; only few previous researches could offer good solution. In this paper, we proposed a novel method to detect the unexpected audio elements in multimedia sensor networks. Firstly, we collect enough normal audio elements and then use statistical learning method to train them offline. On the basis of these models, we establish a background pool by prior knowledge. The background pool contains expected audio effects. Finally, we decide whether an audio event is unexpected by comparing it with the background pool. In this way, we reduce the complexity of online training while ensuring the detection accuracy. We designed some experiments to verify the effectiveness of the proposed method. In conclusion, the experiments show that the proposed algorithm can achieve satisfying results.


Introduction
Nowadays, multimedia sensor networks (MSNs) become increasingly popular and important in our everyday lives [1,2]. We can detect traffic accidents on a bustling road or wild hunting in rainforest by deploying video cameras or audio sensors.
Most monitoring systems utilize video cameras to detect abnormal events such as traffic accident or fire in forest [3]. However, video cameras cannot work well in some special situations, especially without sufficient light or when the sightline is blocked. Under these circumstances, audio sensors can provide us with sufficient information to make up for the lack of video sensors. It is becoming increasingly critical to use audio sensors to improve the effectiveness for monitoring systems, especially when video cameras cannot work effectively (e.g., the sightline is blocked). Audio sensors also have price advantage. Our research aims to utilize the acoustic clues as complementary information to automatically discover and analyze abnormal situations. Our goal is to make full use of audio cues, so as to access accurate detection and analysis of abnormal events. Audio based surveillance system has been studied for many years. In [4] the authors designed a novel method to detect human coughing in the office. In [5], the authors used a SVM-based method to build an office monitoring system. This system can detect some impulsive sound such as door alarm and crying. In [6], the authors designed a HMM-based method to detect some special audio elements such as gun-shot and car-crashing. However, in some special monitoring systems (e.g., in the forest monitoring system), there is no need to distinguish gun-shot from animal scream; it is necessary to judge whether the event is expected to happen at a specific time and a specific location. Only few researches paid attention to define the background sounds and use them to detect some target audio effects [7,8]. However, these researches are usually designed for some relatively quiet environments, such as office buildings, and thus cannot be directly used in noisy forest environment.
In summary, in order to detect abnormal audio events in complicated environment, it is critical to build a very large model for expected events, which require a large number of training samples and a considerable amount of 2  computing power consumption. In this paper, we establish a comprehensive background pool to cover all the expected sounds. And then, we decide whether an audio event is unexpected by comparing it with the background pool. In order to get the model of the background pool, we first collect enough training samples for each expected audio effect and train them separately by using HMM. By doing so, we set the transition probabilities between these expected audio effects by some prior knowledge. By this way, we have established a hierarchical model, background pool model, to detect the unexpected audio effect. The advantage of this approach lies in the fact that we can reduce the costs of online training through training each basic audio effect model offline. In all, this method has better flexibility and scalability; that is, when the monitoring environment changes, there is no need to retrain the background model; we only need to add some new basic models into the background pool or remove some from the pool. The rest of this paper is organized as follows. In Section 2, we describe the system architecture briefly. Section 3 presents the feature extraction method. In Section 4, we introduce how to build the model of the background pool. In Section 5, we present the abnormal event detection process. In Section 6, we show the experimental results. In the end, we conclude the paper and discuss the future works in Section 7.

Framework Overview
As is shown in Figure 1, the abnormal event detection system can be divided into two important parts, offline training process and online testing process. In the offline training process, we first collect enough training samples for each expected audio element and use HMM to train them offline. And then, the relationship among basic audio elements is determined by prior knowledge. In the online testing process, the audio sensor nodes capture the environmental information and extract the audio features. And then similarity degree between the audio signal and the background pool is calculated by the Viterbi algorithm. Finally, the cluster head fuses the information in its cluster and makes final decision.

Feature Extraction
Feature extraction plays a fundamental but essential role in pattern recognition, which determines the accuracy of the recognition results directly. Many audio features have been effective in previous research on audio classification [9,10], for example, short-term energy and short-time zero-crossing rate. Since it can simulate human auditory system, melfrequency cepstral coefficients (MFCCs) have been widely used in audio classification system in recent years. As is suggested in [11], eight-order mel-frequency cepstral coefficients (MFCCs) are selected for the proposed method. MFCCs are the mathematical coefficients for MFC and can be extracted as follows.
Step 1 (frame blocking). In this step, we blocked the continuous audio signal into several frames; each frame is composed of samples. The adjacent frames have overlapping samples. Obviously, < . According to some previous research, we set = 256 and = 100.
Step 2 (windowing). In this step, we reduce the discontinuities in the junction of two frames by windowing. Suppose defining ( ) as the original signal for each audio frame, ( ) is the window function, and the signal for each frame after windowing is as follows: (1) Step 3 (fast Fourier transform). In this step, we carry out a fast Fourier transform on the signal after windowing. That is to say, we convert the frames from the time domain to the frequency domain. The signal after fast Fourier transform is as follows: Step 4 (mel-frequency wrapping). In this step, we simulate the human auditory system by a filter bank. As is shown in Figure 2, the filter bank has a triangular band-pass frequency response, and the spacing is determined by a constant melfrequency interval. Suppose that the number of mel spectrum coefficients is K, and according to previous research we set = 20.
Step 5 (discrete cosine transform). In this step, we convert the log mel spectrum from the frequency domain back to the time domain (MFCC) using the discrete cosine transform (DCT). We denote the mel power spectrum coefficients to be the result of the last step , = 1, 2, . . . , , and then the MFCC's ( ) can be calculated as follows:

Background Pool Modeling
In the complicated monitoring environment, multiple audio elements may occur at the same time. How to build models is an important issue in detecting abnormal audio events. It is rather easy to use ICA (Independent Component Analysis) to separate different types of audio effects as in some controlled environment, such as movies. However, when it comes to the real scene, such as in a noisy rainforest, it is difficult to do so. What is more, because millions of data are required, building a huge model is so difficult that people have rarely achieved satisfying results based on that up to now. As a result, a background pool has been built, in which we train the basic effects, respectively, so as to observe the expected event. And then, we set transition probabilities among these elements according to some specific rules. This solution can help us effectively train those elements separately. In addition, this method has better flexibility and scalability. That is to say, although the monitoring environment changes, there is no need to retrain the background model; what is needed is to add some new basic models into the background pool or remove some from the pool; no extra training is needed.

Basic Audio Element Modeling.
As is known to all, many previous researches on audio classification have been done to prove the effectiveness of Hidden Markov Model (HMM) [6,9]. In this paper, we utilize HMM to train the signal audio effects. The model for th basic audio element (BE ) in the background pool is defined as follows: is the transition probability distribution matrix between the states.
(iv) is the observation probability distribution matrix for th model.
(v) Π is the initial state probability vector. According to some previous works [6], the initial state probabilities are set to be equal; that is, How to set the number of hidden states in the models directly determines the detection accuracy. On the one hand, the model states should be sufficient to describe acoustical characteristics. On the other hand, a large number of states may increase the complexity of the training and testing process. In this paper, we did a large number of experiments to balance the energy consumption and the detection accuracy and then set an appropriate model size for each basic audio element. In this paper, we apply our proposed method to the noisy forest monitoring system, where 9 basic audio effects are collected to represent the background sound in the forest environment, namely, the crying of animals, chirping of insect, sound of water, sound of wind, sound of rain, sound of footstep, sound of inciting wings, and other backgrounds. The model size of each basic element is shown in Table 1. The result is reasonable because a large number of experiments are used to verify its effectiveness.
For each basic audio element, we collect about 50-70 short clips as the training samples. We extract the MFCCs for each audio clip, and then the extracted MFCCs vectors are used as the input observations for the HMMs. According to some previous works [6], the Baum-Welch algorithm is then applied to estimate the transition probabilities between states and the observation probabilities in each state. After that, we have built the model for each basic audio element.

Background Pool Model.
As is described above, the background pool is composed of several expected basic audio elements. For instance, in the forest environment, the background is often composed of the sound of rain, footstep, inciting wings, and so on. In many previous researches, researchers divided the audio signal into foreground sound and background sound. In this paper, we consider the basic audio elements that usually occur as the background sound and the audio elements that seldom occur as the foreground sound. For instance, in the forest environment, the sound of wind and water usually occurs, while the crying of animal rarely appears. We introduce a background pool to store all of the expected audio elements, and the background pool consists of the background sound and the foreground sound. The background pool will change in accordance with different monitoring environments.
For a given background pool , let be the set of foreground elements and let be the set of background elements: where BE is th audio element in the foreground set and BE is th audio element in the background set. We have Then, the background pool model is defined as follows: where = {⟨BE , BE ⟩ | BE , BE ∈ ∪ and }, where is the transition probability from BE to BE . Then, we will discuss how to get the value in detail. In the forest monitor system, we built a background pool based on 9 basic elements (see Table 2).
We assume the following: (1) An element in the background set can transfer to other background elements and the elements in the foreground set.
(2) An element in the foreground set can only transfer to itself and the elements in the background.
Given a basic audio element BE , we define its subsequent set, Φ(BE ), as a set of all the basic audio elements which BE can transfer to; that is, In order to reduce the complexity of training the transition probabilities, for a given basic audio element BE , ∀BE ∈ Φ(BE ), the transition probability from BE to BE can be set as follows: In the end, we connect the audio effect models by some specific rules to build the model for the background pool.

Abnormal Audio Event Detection
In the online testing stage, each sensor collects audio signal in its own perception area. Firstly, the basic audio features energy and zero-crossing rate are extracted to analyze whether it is silent. If it is not a silent clip, the audio clip will be estimated by the background pool set; thus the log-likelihood International Journal of Distributed Sensor Networks 5 value will be calculated. According to previous research, we use the Viterbi algorithm to compute the similarity of each audio clip and the background pool. Then each sensor transmits the current log-likelihood value to the cluster head.
Consider a cluster with sensor node. The cluster head will fuse the collected information in its cluster as follows: where denotes the log-likelihood value transmitted from th audio sensor node and is the weight of th audio sensor. Obviously, the weight of each sensor node is determined jointly by many factors such as the distance from the key location, satisfying In this paper, we set the weight value according to the instant short-term energy and the average short-term energy for each audio sensor node. Suppose that denotes the instant short-term energy for th audio sensor and denotes the average short-term energy for th audio sensor; then the relative energy change rate can be gained as follows: Generally, the closer th node is apart from the instant audio event, the higher will be got. In this paper, the average short-term energy is regularly updated.
The weight value of the th can be got as follows: And then we will discuss how to determine whether there is an abnormal event based on the fused log-likelihood. In some previous research, researchers set a threshold to detect the abnormal event. That is to say, when the similarity between an audio clip and the background pool set gets close, the audio clip will be considered as normal sound, and vice versa. However, in the complicated environment monitoring, the background changes from time to time; thus it is hard to determine a threshold to adapt to dynamic monitoring requirements. Moreover, in monitoring systems, different missed detection will lead to different risk. Based on the above analysis, we make the final conclusion based on the minimum risk Bayesian decision theory.
Let be observed audio clip; is the fused log-likelihood value; we define the following: 1 : is a normal audio event. 2 : is an abnormal audio event. 1 : make the decision that is a normal audio event.
2 : make the decision that is an abnormal audio event.  Let ( , ) be the risk factor for making the decision of while the fact is . In this paper, we define the risk decision ratio as = (1, 2) : (2, 1), and this value should be set through a lot of experiments.
Then, we calculate the risk value for making the decisions 1 and 2 , respectively, according to [8]. Suppose that 1 denotes the risk for making the decision 1 and 2 denotes the risk for making the decision 2 . Then we make the conclusion as follows: (i) The current audio clip is normal if 1 / 2 ≤ 1.

Experiments
In order to evaluate the performance of the proposed method, we deploy the algorithm in an audio wireless sensor network. As is shown in Figure 3, the selected cluster has 8 sensor nodes and one cluster head. In the experiment, we use a PC as the cluster head and the nodes transmit messages through the ZigBee wireless communication protocol.
The detailed parameters of the sensor nodes and the cluster head are described in Tables 3 and 4.

Evaluation of the Background Pool Model.
In this section, we choose 4 different types of abnormal audio elements to evaluate the performance of the background pool model (BGP), namely, engine, animal screams, gun-shot, and tapping sound of sticks. The expected data are collected from some documentary films such as "Animal Legend," "Animal World," and "Wonderful Broadcasting: Battle for survival The Animals' Guide to Survival." The abnormal data are collected from some documentary films and action movies. In this experiment, we compare the proposed method with both SVM-based method and HMM-based method. The SVMbased method is introduced in [5], and the Gaussian radial basis function is used as the kernel function. The HMMbased method is introduced in [6], which has been widely used in creating keywords thus to be retrieved in movies. According to [6], the state number for each abnormal audio is set in Table 5.  For each target abnormal audio event, we use the precision and recall to evaluate its detection accuracy: where represent the number of audio frames detected correctly, represents the number of all the audio frames determined as the specified audio type, and is the total number of frames for the specified audio type in truth. The experiment results are shown in Figure 4.
From Figure 4, we can see that, because of the complexity of the noisy forest monitoring environments, most previous works need a large number of training samples to ensure the accuracy of the detection. When the number of training samples reduces, the detection accuracy for HMM-based method and the SVM-based method reduces dramatically. By using the proposed background, we can reduce the complexity of the online training while ensuring the detection accuracy.
In addition, this method has better flexibility and scalability. That is, when the monitoring environment changes, we do not need to retrain the background model; we only need to add some new basic models to the background pool or remove some from the pool.
6.2. Evaluation of the Decision Algorithm. As described in Section 5, how to set the risk decision ratio is very important in detecting the abnormal event, which directly determines We can see in Figure 5 that as the risk decision ratio grows from 1 to 20, the detection recall is obviously increased. The reason lies in the fact that in the complicated monitoring environment several audio elements may occur at the same time; abnormal audio elements are usually mixed into the background noise. Take the sound of gun-shot, for example; since the duration of gun-shot is very short, the sampling window containing the sound of gun-shot may consist of at least two types of audio elements. By using the threshold based method, the sampling window is easy to be considered as other audio elements, while, by using Bayesian decision based method, we significantly improve the detection recall for abnormal audio events. However, when the risk decision ratio increases to a certain extent, the improvement of the recall is not obvious. In addition, the increase of the risk decision ratio will also affect the detection precision, especially when the value ranges to more than 25. We can see that better detection accuracy could be accessed when the risk decision ratio is set ranging from 10 to 20.

Evaluation of the Flexibility and the Scalability.
In order to detect the abnormal audio events, there are two most common methods, modeling for the normal environment or modeling for the abnormal audio events. Then we compare the proposed method with these two methods when the monitoring environment or the monitoring tasks change. The comparing results are shown in Table 6.
When the background changes, environment-modeling method needs to collect enough samples and retain the background model to achieve satisfying detection accuracy; it will waste a lot of time. What is more, when the environment is complex, the model is very difficult to converge. By using the proposed method, only the transition probabilities need to be redefined, without any extra system retraining. When the abnormal events change, abnormal-modeling method needs to collect enough samples for the abnormal audio events. The detection accuracy relies on the completeness of training samples. However, it is difficult to collect enough samples for the unexpected abnormal events in a short time. The proposed method will not be affected by this change.

Conclusions
In this paper, we propose a novel method to detect the abnormal audio event for complicated environment monitoring by using audio sensor networks. Firstly, we collect enough normal audio elements and use statistical learning method to train them offline. On the basis of these models, we establish a background pool by prior knowledge. The background pool contains expected audio effects. Finally we decide whether an audio event is unexpected by comparing it with the background pool. In this way, we reduce the complexity of the online training while ensuring the detection accuracy.