Prediction-Based Filter Updating Policies for Top-k Monitoring Queries in Wireless Sensor Networks

Processing top-k query in an energy-efficient manner is an important topic in wireless sensor networks. Redundant data transmitting between base station and sink node is avoided by installing filters on sensor nodes; thus, communication overhead between base station and sensor nodes is decreased. However, existing algorithms such as FILA, and DAFM consume much energy when updating the filter window. In this paper, we propose a new top-k algorithm named PreFU which is based on prediction models to update window parameters of filters. PreFU can predict the next s step sensor values based on time series predicting models which can be built by historical data. By estimating the cost of updating window parameters based on predicted sensor values, updates of filter window parameters can be reduced. Thus, the cost of updating window parameters is decreased. Experimental results show that our PreFU algorithm is more energy-efficient than existing algorithms while guaranteeing the accuracy of top-k query results.


Introduction
Despite all progress within research, energy consumption is still a major issue for wireless sensor networks [1,2]. In addition, the quantity of generated data is large and dense. However, users are always interested in max or min objects among them. Thus, top-query processing is in high demand for many applications in uncertain databases and relational databases [3][4][5]. The top-query in wireless sensor networks is different from general queries that data itself on sensor node could not determine whether it would be in final results. An intuitive way is that top-query results will be determined by base station after collecting data from all sensor nodes. For our top-queries, is not restricted and the query determines the highest observed values. The query also determines the full set of nodes that report the highest values and is executed periodically starting at some point in time and reporting values for a number of subsequent rounds. The centralized query processing produces a large amount of communication cost and wastes lots of energy. So, how to process query in an energy-efficient manner is an important topic in wireless sensor networks, which can supply topresults to users by minimizing the energy consumption. Generally speaking, query processing in wireless sensor networks is essentially different from that in traditional databases. A wireless sensor network containing sensor nodes can be viewed as a distributed system, while this special distributed system is different from the general distributed system for there is not any single powerful node serving as the collection center to collect data from all the sensors. Each sensor transmits its data to the base station through multihop relays, which consumes energy for each data transmission. From another aspect, as the major optimization objective for query processing in wireless sensor networks, the network lifetime is determined not only by the total energy consumption of all sensors but also by the maximum energy consumption among the sensors. The sensors near to the base station consume more energy than the others, because they relay the data for the others and they will exhaust their batteries first. Once they run out of energy, the rest of the sensors will be disconnected from the base station, no matter how much residual energy left the rest of the sensors. Consequently, 2 International Journal of Distributed Sensor Networks (1) top-← Query( ) ⊳ Initialize top-set after collecting data from all sensor nodes (2) if Get new values on base station then (3) ← reevaluation() ⊳ Obtain sensor node set in which the filter window may be updated (4) end if (5) ← 0 (6) repeat (7) for = 1 to do (8) ← predict( ) ⊳ Using ARIMA models to predict (9) end for (10) ← + 1 (11) Calculate , update , old for -step-ahead prediction (12) until = = (13) if + update < old then (14) Update-Window() ⊳ Update windows in for nodes in (15) end if Algorithm 1: PreFU algorithm. the network is no longer functioning even if the total energy consumption per query is reasonably small. Hence, how to evaluate queries effectively and efficiently in wireless sensor networks poses great challenges.
A typical solution for answering top-queries in wireless sensor networks is by making the use of filters. Filters are broadcasted into the network and used by individual sensor node to decide whether its value is relevant to compute the query result. FILA (filter-based monitoring approach) [6] is an energy-efficient approach among these algorithms. The basic idea is to install a filter at each child sensor node, which avoids redundant data from transmitting to parent nodes or base station. However, the approach consumes much energy on updating filters. Mai et al. [7,8] proposed DAFM algorithm according to FILA. The algorithm predicts the next sensed values of sensor nodes by linear regression model. Then, base station decides whether to update filtering windows based on predicted benefits. When data is in complicated distributions and varies widely over time, the predicted performance of linear regression is unsatisfactory such that the performance of DAFM algorithm gets worse. The contributions of this paper are summarized as follows.
(i) We adopt a powerful time series model ARIMA to predict next steps sensor values based on historical sensor data. The ARIMA time series model is suitable for sensor data due to the temporal correlations. Our proposed PreFU approach focuses on reevaluation when filters need updating. PreFU introduces prediction mechanism to achieve less energy consumption compared to eager filter update and lazy update policies in FILA approach. Also, compared to DAFM approach, our PreFU approach guarantees smaller mean squared error and thus can perform effectively (Algorithm 1). (ii) Instead of one-step-forward prediction of DAFM approach, we adopt adaptive step prediction based on prediction errors. To select suitable values based on specified sensor data, PreFU outperforms existing approaches.
(iii) Extensive experiments are conducted to evaluate proposed PreFU approach by using real data traces. The results show that our PreFu approach outperforms DAFM, FILE-e, and FILA-l in terms of both energy consumption and network lifetime under single hop and multihop network configurations.
The rest of this paper is organized as follows. In Section 2, we discuss related work. Typical filtering methods for topquery processing are introduced in Section 3 and our proposed PrePU algorithm is provided in Section 4. In Section 5, extensive simulations are conducted to show the efficiency and accuracy of the proposed method. Finally, we conclude this paper in Section 6.

Related Work
A naive implementation of monitoring top-query is to use a centralized approach in which all sensor readings are periodically collected by the base station, which then computes the top-results set directly. However, a wireless sensor network can be viewed as a distributed network which consists of lots of energy-limited sensor nodes, and communication cost is the main energy consumption. Therefore, there is no doubt that the centralized approach will consume extra energy because of the transmission of massive data. In order to reduce the communication cost in data collection, Madden et al. proposed an in-network aggregation technique, known as TAG [9], which keeps unavailable data from transmission compared with centralized algorithms. However, this approach incurs unnecessary updates in the network and is not really energy-efficient.
Wu et al. [6] propose FILA approach for top-monitoring, which is to install a filter on each sensor node to filter out unnecessary data that are not contributed to final results. Reevaluation and filter setting are two critical aspects that ensure the correctness and effectiveness. When the new filters are different from the old ones maintained by sensor nodes themselves, base station needs to update them. But, when sensing values on nodes vary widely, base station needs updating filters to related nodes frequently which leads to large scales of updating cost and makes the performance of algorithm worse. Mai et al. [8] propose DAFM approach which aims to reduce the communication cost of sending probe messages in reevaluation aspect as well as the transmission cost in filter updating in FILA. DAFM approach predicts the next value of sensor nodes whose values are whether or not out of filtering windows by linear regression model. To some extent, this approach decreases the cost of filter updating. However, the sensed values are affected by various factors; the performance is worse when linear regression model was used to complicate data.
Besides these, filter-based and aggregation are two main strategies at present that they cope with each other to process top-query in wireless sensor networks. Chen et al. propose QF (quantile filter, QF) [10] approach which treats sensing values and its sensor as a point. A top-query is to return points with highest sensing values. The goal of algorithm is to decrease energy consumption and prolong the lifetime of network which is not only minimizing the total energy consumption but also consuming less energy on each node. Liu et al. [11] propose a new cross pruning (XP) aggregation framework for top-query in wireless sensor networks. There is a cluster-tree routing structure to aggregate more objects locally and a broadcast-then-filter approach in the framework. In addition, it provides an innetwork aggregation technique to filter out redundant values which enhance in-network filtering effectiveness. Abbasi et al. [12] proposed MOTE (model-based optimization technique) approach based on assigning filters on nodes by model-based optimization. Nevertheless, it is an NP-hard problem on how to get optimal filter settings for top-set.
In addition, there are other related works about how to process top-query in wireless sensor networks. Yeo et al. propose a novel technology called PRIM (priority-based topmonitoring) which relied on the semantic of top-query [13]. Its basic idea is to gather data according to priority; that is to say, the higher readings are collected earlier. Cho et al. propose POT algorithm (partial ordered tree) which considers the space correlation to maintain sensor nodes with highest sensing values [14]. Michel et al. propose a framework KLEE to process top-query [15] which allows for trade-off efficiency against result quality and bandwidth saving against communications. Different from the above, Silberstein et al. propose a sampling-based approach to evaluate approximate top-queries in wireless sensor networks [16].
Energy-efficiency is a critical issue in wireless sensor networks and also an important indicator to evaluate the effectiveness and practicality of algorithm. In recent years, some researchers introduced time series models to wireless sensor networks. Tulone and Madden [17] apply autoregressive models to data collection in wireless sensor networks. The basic idea is to build a model on base station and each sensor node. When base station predicts the values of nodes until outlier readings produced, the nodes send sensing readings to base. When a reading is not properly predicted by the model, models are relearned to adapt changes. They are approximate readings collected by the approach. [18] is similar to [17]; both of them are approaches based on time series predicting in wireless sensor networks. The main difference is that the previous one is based on ARIMA. Although time series have been applied to wireless sensor networks, they just utilize to minimize energy consumption on data collection, not applied to top-query processing. In this paper, we propose a novel top-monitoring algorithm PreFU that combined the time series model ARIMA with FILA approach which avoids unnecessary filter updating cost and minimizes sensor node energy consumption.

Filtering Method for Top-Query Processing
A typical system architecture of a wireless sensor network includes a base station and a number of sensor nodes. The base station has enough energy while sensor nodes are powered by batteries and energy-limited. When the base station is beyond a sensor node's radio coverage, data are sent to base station via other sensor nodes through multiple jumps. Otherwise, sensed data are sent to base station directly. Sensor nodes sense the physical phenomenon at a fixed sampling rate, such as temperature, humidity and light. When receiving the top-query request, base station starts query and returns the results set to users at regular intervals. Assume that the sensed value on node is V , a top-query is to return the ordered list of sensor nodes R with the highest readings at every epoch; that is, The results are maintained by base station and returned to users finally. The goal in this paper is to prolong the lifetime of wireless sensor networks by minimizing the overall energy consumption.
Initially, after collecting the readings from all sensor nodes, the base station sorts the readings and gets the initial top-result set. The base station computes a filtering window [ , ] for each node based on the initial top-result set. Then, it sends these windows to corresponding sensor nodes. At the next sampling epoch, if the value on node is within [ , ], node need not update its value maintained by base station. Otherwise, updating request is sent to base station. The base station will then reevaluate the top-result and adjust the filter settings for influenced sensor nodes. According to different updating strategies, the base station sends the new filters to relevant sensor nodes based on updating strategies. In FILA, they provided two strategies: lazy filter update (FILA-L) and eager filter update (FILA-E).
In order to ensure the correctness of the algorithm, the space which is formed by all filtering windows should be continuous; that is, we set +1 to be equal to . FILA devises two filter setting approaches: uniform filter setting and skewed filter setting. In this paper, we adopt uniform filter setting; take Figure 1 as an example; the filtering windows of a top-result are set by the following equation: To maximize the filtering capability, the upper bound of top-1 is set to +∞; the lower bound of nontop-nodes are set to −∞. Then, the filtering window of nontop-nodes 4 International Journal of Distributed Sensor Networks · · · n k+1 k 3 2 1 l k+1 u k+1 = l k u k = l k−1 u 4 = l 3 u 3 = l 2 u 2 = l 1 u 1 Figure 1: + 1 windows in FILA.  which are ranked at th is [−∞, ]. Seen from the above, the base station just needs maintaining + 1 filtering windows as shown in Figure 1.
To some extent, FILA avoids redundant data from transmission and saves energy. However, there is massive unnecessary energy consumption. Assume that, after sampling in epoch , the filtering windows are shown in Figure 2.
In Figure 2(a), at epoch , before sampling top-3 has the value of {V 1 , V 4 , V 3 }, and the results set is {V 1 , V 4 , V 3 } after sampling as shown in Figure 2(b); that is, the value V 3 jumps out of its filter and falls into the filtering window installed on 4 . The base station needs to adjust the filtering windows for relevant nodes which are shown in Figure 2(c). However, after sampling in epoch +1 as shown in Figure 2(d), the value V 3 changes again and jumps into the frontal window. In FILA, fluctuation of data will cause frequent updates of the filter windows which will incur large scale of communication cost. Mai et al. [8] proposed an updating algorithm based on prediction which determines whether to update the filters according to the possible cost on updating. The algorithm decreases the cost on updating the filters to some extent. However, this approach is limited by the prediction performance of linear regression model.
In addition, from Figure 2, the values on the nontopvary in a small range, their filtering windows are useful to filter out irrelevant sensor nodes. However, the upper bound of nontop-nodes is determined by the values of th and ( + 1)th sensor nodes; when the value which is ranked at th varies, the filters on nontop-will be affected directly. For a wireless sensor network that generally consists of large number of sensor nodes, is usually small in a top-query while a large number of sensor nodes fall into nontop-node set. If the algorithm updates the filters once reevaluating the top-results, it will consume more energy on updating the filters of nontop-nodes.
Inherent defects of FILA and DAFM need better mechanisms to avoid too much communications while updating windows. So, we propose a new updating algorithm PreFU based on prediction by autoregressive integrated moving average (ARIMA) models. The algorithm evaluates the possible communication cost based on the -step-ahead predicted values; usually is smaller than 10. Then, the base station decides whether or not to update relevant filtering windows.

PreFU Approach
In FILA, when setting of the new filtering windows for nodes is changed after reevaluating the top-results, the filters on the nontop-nodes are affected, and the frequent fluctuation of data will lead to unnecessary communication cost on updating windows. Considering two aspects above, by updating approach based on prediction, the algorithm decides whether to send the new filtering window parameters to corresponding nodes. In fact, we could not get the exact future values for each sensor node, but it can be obtained by prediction. In this paper, our PreFU approach predicts the next value(s) by ARIMA models.
Equation (2) is an autoregressive model and is the order of the model. The parameters of the model are 1 , 2 , . . . , and is a white noise series. Since usually a time series may receive random shocks in a noisy environment, the MA is introduced to capture the influence of random shocks to the future. We call it moving average model MA( ), in which random errors at time and − 1 in the time series satisfy a linear regression model, as shown below: where 1 , 2 , . . . , are moving average parameters of model which will be estimated. In general, the orders of AR or MA are high in order to describe the dynamic structure adequately. Without integrated components in ARIMA, the model is simplified as follows: where 0 is an initialized constant. However, ARMA model usually assumes that data is stationary; that is, the statistical properties of data do not change over time. This assumption does not hold in most real data series; therefore, the integrated term is introduced to remove the impact of nonstationary data by differencing. The time series is satisfied with ARMA model after several times of differencing. In general, first-order differencing is sufficient. Therefore, a time series is represented by an ARIMA ( , , ) model that the time series is stationary after times of differencing. And represents the amount of historical data with successive timestamps; represents the number of the latest random shocks on time series.

Time Series Model-Driven Prediction
. Prediction by ARIMA model usually consists of two steps. The first step is model identification and parameter estimation. The second is prediction.

Model Identification and Parameter Estimation.
Each parent node builds an ARIMA model to predict the next threshold filter . In order to predict, parent V maintains enough thresholds, is an epoch number and , is the sensor reading of V at epoch . ARIMA model on node V is built by . For simplicity, we suppose that = 0; that is, the time series of is stationary and is satisfied with ARMA( , ) model.
In order to get and , we introduce function of selfcorrelation (ACF) and partial autocorrelation (PACF). Selfcorrelation describes the simple correlation between values in time series. If represents the self-correlation parameters, then it describes the correlation of sensor values. The computation is shown as below: is the number of samples; is the distance of interval; in general, = 20. is used to estimate the expectation of time series which describes the average value in arithmetic. PACF describes the conditional correlation between and − when given time series of −1 , −2 , . . . , and − +1 . The degree of correlation is measured by and is estimated by partial autocorrelation parameters. Consider We can get the estimated values of and according to PACF and ACF. That is, the sample PACF cuts off at lag . As for , we can get it by ACF. For a time series with , if ̸ = 0 when ≤ and when > , then the order of MA is . However, we just get the estimated and ; they can be decided by information criteria called AIC (Akaike information criterion). AIC criteria find the minimum * and * to minimize the value of AIC. We view * and * as the optimal estimated values of and . The computation of AIC is shown as below: is the sample size.̂2 is the maximum-likelihood estimate of 2 . Consider After order determination of the ARMA model, we estimate parameters ( 0 , 2 , . . . , ) by the conditional leastsquares method. Now, we can predict the next value of threshold by ARIMA model. In this procedure, the order of model will not be changed, but the parameters of the model are self-learning while processing top-queries which ensures that prediction error is acceptable.

Predicting.
As illustrated by (1) in Section 4.2, suppose that we are at time epoch ℎ; is a time series formed by previous epoches just before ℎ; one-step-ahead predicted value V pre of the time series is calculated as below:

Evaluation of the Cost.
Predicting the next values for each node in N , the algorithm evaluates the possible cost adopting the new and the old filtering windows separately. We denote the cost by new and old . The cost is referred to the number of communications caused by the fact that the values of nodes violate filters at the next sampling epoch. Only when new < old , the base station sends the new filtering windows to nodes in .
Let V pre be th step prediction value of node ( ∈ N ); we calculate old and new as follows: where is the cost of updating filter windows for each sensor node in and for -step-ahead prediction, each sensor node in just needs to be updated only once; that is, = | |. And and are computed in the same way as (1); that is, If the value of the sensor node i violates the filter, V uses the new value sent to the base station for calculation. Otherwise, V = V , that is, using old value for calculation.
In our PreFU algorithm, if the base station updates the filters to nodes, the cost is composed of two aspects: the cost of updating the new filters and the cost of sending the updated values violating the filters at the next step sampling to base station, denoted as update and , respectively. If the base station need not to update the filters, the cost of using the old filters at the next sampling epoch is referred to as old . When new < old , the base station updates the new filters. Otherwise, all nodes adopt the old filters.

Experiment Settings.
We evaluate the performance of the proposed algorithm PreFU by MATLAB simulations. The experimental data is from Intel Lab [19]. We adopt temperature, humidity and light data on March 1st, 2004, as experimental data. The data were collected from 54 sensor nodes, and data are collected every 31 seconds. In the paper, we treat the average of every two sample values as one.
International Journal of Distributed Sensor Networks We compare our proposed PreFU algorithm with FILA-E, FILA-L, and DAFM under uniform filter setting [6] for each window. We view the times of communication for a topquery as the measurement with different . Considering that the energy consumption of sending is different from receiving, we simulate Mica2Dot sensor node which the energy consumption of sending is 0.37 times of receiving in this paper. The network lifetime is defined as time duration before the first sensor node runs out of power [6]. We adopt 2 AAA batteries with 15120 for total energy capacity while the energy to establish a connection is 0.645 and sending a byte data consumes 0.0144 .

Comparison of Prediction Performance in Different
Models. We learn the ARIMA(2,0,2) utilizing historical data in the experiment. The parameters on different nodes are maintained by the base station dynamically. Figure 3 shows the predicted performance comparison of two models for -stepahead predicting. The prediction performance of ARIMA models is prior to linear regression model. Using ARIMA model to predict, the error rate is much smaller than linear regression model when = 5, 6, 7, and 8. We set to 5; that is, for every 5-step-ahead predicting we check whether to update the window parameters for each sensor.

Comparison of Four Algorithms in Different Network
Structures and Data Distributions. First, the different network structures may affect the performance of algorithm. We evaluate the effectiveness of algorithm based on single-hop and multihop networks as shown in Figure 4 (the hop number is 2). Second, the different data distributions have affected the performance of algorithm. We evaluate the algorithm using following sensor attributes temperature, humidity, and light in different network structures to illustrate the efficiency of the proposed algorithm. Figures 5 and 6 show the results of our proposed algorithm and existing algorithms in different network structures and data distributions.
As shown in Figures 5 and 6, under two possible network structures single-hop and multihop and three different data temperature, humidity, and light, the proposed PreFU algorithm is prior to existing algorithms. Take Figure 5(b) as an example, when = 10, PreFU algorithm is less 7000 communications than FILA-L and less 4000 communications than DAFM algorithm. The advantages lie in two aspects: first, is smaller compared with the total number of nodes in the wireless sensor network. In addition, the nodes in the set of nontop-share the same filter which is affected by the value on node ranked th. Once the value on node ranked th changed, the filtering windows in the set of nontop-do not need to be updated. So, we update the filters-based -stepahead prediction which decrease the energy consumption. The second aspect is that the ARIMA model is prior to linear regression model on performance of prediction.
International Journal of Distributed Sensor Networks

Comparison of Lifetime under Different Approaches.
We now consider the lifetime of DAFM, FILA-E, FILA-L, and our PreFU approaches under single-hop ( Figure 7) and multihop (Figure 8) network distributions. In Figure 7, our PreFU approach is prior to other three approaches under temperature and humidity data. When is small, our approach performs much better. However, under light data, when is small, our PreFU approach is worse than DAFM approach. The reason is that light changes rapidly in real world and one-step-prediction will be better under small values. When increases, we should consider more sensor nodes as top-results, our proposed PreFU approach performs better. In Figure 8, the same situation occurs and the reason is the same as above.

Conclusion
In order to cope with the problem that there is unnecessary updating cost in top-query processing, we propose a new top-query algorithm called PreFU which is based on time series prediction models. The algorithm evaluates the cost of updating the filters based on ARIMA prediction models which is built on the historical sensor data. Our PreFU