Adaptive Filter Updating for Energy-Efficient Top-k Queries in Wireless Sensor Networks Using Gaussian Process Regression

Adopting filtering mechanism of dynamic filtering windows installed on sensor nodes to process top-k queries is an important research direction in wireless sensor networks. The mechanism can reduce transmissions of redundant data by utilizing filters. However, existing algorithms based on filters consume a vast amount of energy due to filter updating. In this paper, an energy-efficient top-k query technique based on adaptive filters is proposed. Due to updating filters consuming a large amount of energy, an algorithm named FUGPR based on Gaussian process regression to process top-k queries is provided for saving energy. When the filters change, the sensor readings are predicted to calculate the updating costs of filters; then FUGPR decides whether the filters need to be updated or not. Thus, the energy consumption for updating filters is decreased. Experimental results show that our approach can reduce energy consumption efficiently for updating filters on two distinct real datasets.


Introduction
With the improved recognition to the physical world and the rapid development on technologies such as electron and wireless communication, wireless sensor networks have been applied to many areas such as military, medical treatment, and environment surveillance. The bright future of wireless sensor networks has attracted the attention of so many scholars. In wireless sensor networks, sensor nodes are energylimited which are powered by batteries. In addition, various kinds of sensors generate a large amount of data. If all the data is sent to the base station, much energy will be consumed and sensors will run out of energy soon.
For large quantity of data generated by wireless sensors, users are always interested in max or min objects among them. People are often interested in the maximum or minimum values in a wireless sensor network. A top-query which returns sensor nodes with the highest readings can meet the requirements above. Top-queries have been widely used in wireless sensor networks [1]; for instance, it can be used to effectively monitor environmental and ecological changes. Assume that there are many bird feeders placed in a forest, each of which includes a sensor node. These sensor nodes can detect weight changes to count the number of birds landing at the feeder; thus ornithologists can estimate the number of birds in different areas and choose areas to observe living habits of birds. Yet, some sensor nodes frequently become the top-queries candidate sets. After several runs of top-queries, some failed nodes will emerge, because these nodes frequently perceive and transmit data. According to the top-query semantic of the example above, the best sensor nodes often become failed nodes. After that, the top-query results can not reflect the true distribution of the birds in the forest due to the failed nodes whose energy is exhausted.
To answer top-queries in wireless sensor networks, a typical solution is by making the use of filters. Filters are broadcasted into the network and used by individual sensor node to decide whether its value is relevant to compute the query result. FILA (filter based monitoring approach) [2,3] is an energy-efficient approach among these algorithms. The basic idea of FILA is to install a filter at each child sensor 2 International Journal of Distributed Sensor Networks node, which avoids redundant data transmitting to parent nodes or base station. However, the approach consumes much energy on updating filters. Eager filter update policy in FILA updates windows parameters immediately once sensor readings violate the filters while lazy update policy cannot guarantee the updated filters need not be reupdated to the old ones in next epochs. By predicting future sensor readings, we can decide whether it is necessary to update current filters for saving energy in a long period. Mai et al. [4,5] proposed DAFM algorithm according to FILA. The algorithm predicts the next sensor readings of sensor nodes by linear regression model. Then, the base station decides whether to update filtering windows based on predicted benefits. When sensor data satisfies complicated distributions and varies widely over time, the predicted performance of linear regression is unsatisfactory and the performance of DAFM algorithm gets worse. These issues are essential to top-queries in wireless sensor networks and are investigated in this paper. The contributions of this paper are summarized as follows: (i) We adopt a powerful Gaussian process regression (GPR) model to predict next steps' sensor readings based on historical sensor data. The GPR model is suitable for sensor data due to the temporal correlations. Our proposed filter updating based on Gaussian process regression (FUGPR) approach focuses on reevaluation when filters need updating. FUGPR introduces prediction mechanism to achieve less energy consumption compared to eager filter update and lazy update policies in FILA approach. Also, comparing to DAFM approach, our FUGPR approach guarantees smaller mean squared error thus can perform effectively.
(ii) Instead of one-step-forward prediction of DAFM approach, we adopt adaptive step prediction based on prediction errors. To select suitable values based on specified sensor data, FUGPR outperforms existing approaches.
(iii) Extensive experiments are conducted to evaluate proposed FUGPR approach by using real data traces. The results show that our FUGPR approach outperforms DAFM, FILE-, and FILA-in terms of both energy consumption and network lifetime under single hop and multihop network configurations.
The rest of this paper is organized as follows. In Section 2, we discuss related work. Typical filtering methods for topquery processing are introduced in Section 3 and our proposed FUGPR algorithm is provided in Section 4. In Section 5, extensive simulations are conducted to show the efficiency and accuracy of the proposed method. Finally, we conclude this paper in Section 6.

Related Work
A naive implementation for monitoring top-query is to use a centralized approach in which all sensor readings are periodically collected by the base station, which then computes the top-result set directly. The most predominant algorithm for top-processing is the Threshold Algorithm (TA) [6]. Along with TA algorithm, Cao and Wang [7] developed a three-phase protocol (TPUT) which decreases the number of remote accesses in large distributed networks. Theobald et al. [8] introduced a family of approximate top-algorithms with probabilistic guarantees for multidimensional data. By extending TPUT algorithm, Michel et al. [9] proposed the KLEE algorithm to provide approximate answers which provides an adaptive framework to allow trade-off efficiency against result quality and network bandwidth. Zeinalipour-Yazti et al. [10] proposed UB-K and UBLB-K algorithms that return upper/lower bounds instead of exact answers. Also, approximate evaluation of top-queries has also been proposed by Silberstein et al. [11] based on sampling approaches.
In the area of wireless sensor networks, energy-efficient query processing has always been an important issue. A wireless sensor network can be viewed as a distributed network which consists of lots of energy-limited sensor nodes, and communication cost is the main energy consumption. Therefore, there is no doubt that the centralized approach will consume extra energy because of the transmission of massive data. Filtering and aggregation are two main strategies at present that cope with each other to process top-queries in wireless sensor networks. In order to reduce the communication cost in data collection, Madden et al. [12] proposed an in-network aggregation technique, known as TAG, which keeps unavailable data from transmission compared with centralized algorithms. Similar to TAG, Cougar [13] and Kspot [14] also employ a centralized optimizer to coordinate sensor nodes in an energy-efficient manner. Chen et al. propose QF (Quantile Filter) [15] approach which treats sensing values and its sensor as a point. The goal of algorithm is to decrease energy consumption and prolong the lifetime of network which is not only minimizing the total energy consumption, but also consuming less energy on each node. Recently, they develop online algorithms for answering timedependent top-queries with different values of through the dynamic maintenance of a materialized view that consists of historical top-results [16]. Liu et al. [17] propose a new cross pruning (XP) aggregation framework for topquery in wireless sensor networks. There is a cluster-tree routing structure to aggregate more objects locally and a broadcast-then-filter approach in the framework. In addition, it provides an in-network aggregation technique to filter out redundant values which enhance in-network filtering effectiveness. Abbasi et al. [18] proposed MOTE (modelbased optimization technique) approach based on assigning filters on nodes by model-based optimization. Nevertheless, it is an NP-hard problem on how to get optimal filter settings for top-set.
However, these approaches incur unnecessary updates in the network and are not really energy-efficient. Olston et al. [19] propose the range caching mechanism for approximate query processing. In range caching mechanism, the base station caches a value range for each sensor node and computes a tentative result based on cashed value ranges. The cashed values are refreshed only when the new values on the sensor International Journal of Distributed Sensor Networks 3 nodes violate the ranges. Also, Olston et al. [20] use filters to bound error in query result over distributed streams. An adaptive scheme is proposed to reduce the communication cost by precision adjustment at each individual source. Inspired by the filter based ideas [19,20], Wu et al. [2,3] propose FILA approach for top-monitoring, which is to install a filter on each sensor node to filter out unnecessary data not contributed to final results. Reevaluation and filter setting are two critical aspects that ensure the correctness and effectiveness. When the new filters are different from the old ones maintained by sensor nodes themselves, base station needs to update them. But when sensing values on nodes vary widely, base station needs updating filters to related nodes frequently which leads to large scales of updating cost and makes the performance of algorithm worse. Mai et al. [4,5] propose DAFM approach which aims to reduce the communication cost of sending probe messages in reevaluation process as well as the transmission cost in filter updating in FILA. DAFM approach predicts the next readings of sensor nodes to determine these values are in or out of filtering windows by linear regression models. To some extent, this approach decreases the cost of filter updating. However, the sensed values are affected by various factors; the performance is worse when adopting linear regression models to predicate future sensor values. In our previous work [21], time series models are adopted for predicting next sensor values for unnecessary filter updates. However, this approach is more suitable for specific sensor data, that is, time series data distribution and not energy-efficient for non time series sensor data.

Filter Based Top-Query Methods
We first give a formal definition in Section 3.1. Then, Section 3.2 provides filter based top-query methods and discusses the limitations of existing methods.

Problem Definition.
We consider there are sensor nodes in a wireless sensor network labeled 1, 2, . . . , which compose a set , = {1, 2, . . . , }. Each sensor node ( ∈ ) measures the local physical phenomenon V (e.g., temperature, voltage, light, or humidity) at a fixed sampling rate. We consider a top-query that requests the list of sensor nodes with the highest readings, = ⟨ 1 , 2 , . . . , ⟩ for ∀ < , V ≥ V , and ∀ ̸ = ( = 1, 2, . . . , ), V ≤ V . The results are maintained by base station and returned to users finally. The goal in this paper is to prolong the lifetime of wireless sensor networks by minimizing the overall energy consumption.

An Overview of Filter Based Methods.
To reduce network traffic in data collection, TAG [12] as an in-network aggregation technique has been proposed. The topology of the sensor network can be assumed as a spanning tree rooted at the base station. In TAG solution for monitoring topqueries, every intermediate node of the spanning tree collects all values from its children and forwards the largest of these values to the base station. Although TAG achieves a reduction of the number of transmitted values by aggregation, the number of messages remains high. Consider the situation that the final largest values fall in one branch of the spanning tree at some level; the readings of other branches do not need to be forwarded. Therefore, TAG incurs unnecessary transmissions in the network and is not energy-efficient.
Range caching method was first provided by Olston et al. [19,20] and later utilized by Wu et al. [2,3] to process topqueries in wireless sensor networks. The base station cashes a value range for the value at each sensor node. If sensed value is V for sensor node , we take = V + /2, = V − /2 as the upper bound and lower bound of the range. A sensor node is updated with the base station only when the new value is beyond the range of the previously reported value. However, it is obvious that range caching method is simple to process top-queries. However, the ranges (we can take a range as a filter) of distinct sensor nodes are overlapped and topqueries could only be executed after the base station collects all the values of the sensor nodes with the overlapped ranges. The procedure will consume much energy so range caching method is not energy-efficient. Beyond this, the range value is difficult to be decided.
To overcome the defect of range caching method, Wu et al. [2,3] proposed FILA to process top-queries in sensor networks. At the beginning of FILA, the base station collects the readings from all sensor nodes and sorts the sensor readings to obtain the initial top-result set. Then the base station calculates a filter [ , ] for each sensor node and sends it the corresponding sensor node for installation. At the next top-query processing period, if the new reading of sensor node is within the filter [ , ], then no update is sent to the base station. Otherwise, if the new reading violates the filter, then an update is sent to the base station. The base station should reevaluate the top-result and adjust the filters of relevant sensor nodes.
(1) Filter Setting. At the beginning of FILA method, the base station has obtained the sorted readings of all the sensor nodes In FILA, each node in the top-results has a separate filter while all the remaining non-top-nodes share a common one. At any time, we only need + 1 filters. In addition, to ensure the correctness of FILA method, each filter [ , ] should cover their current readings but does not overlap with any other filter. A feasible filter setting To maximize the filtering capability, the upper bound of the top-1 node's filter 1 is set to +∞ while the lower bound of the non-top-'s node filter +1 is set to −∞ and +1 is set equal to for all the nodes except them. In FILA, two filter setting strategies, uniform and skewed settings, are provided, respectively. In this paper, we adopt uniform filter setting for simplicity. That is, + 1 and are set at the midpoint of two sensor readings which is shown as below and Figure 1 illustrates the filter setting of the sensor nodes for a topquery. Consider (1) (2) Query Reevaluation. In FILA, if there is at least one sensor reading sent to the base station for filter violation, the topresults become undecided at the base station. Then the base station should probe some related sensor node(s) to reevaluate the top-results.
When the updated reading overlaps with the filter of any other sensor node, there are three situations for this update.
Internal Update. An update originated from a top-node jumps into the filter of another top-node. For example, as shown in Figure 2 Join Update. An update from a non-top-node jumps over the critical bound and falls into the filter of a top-node. For example, as shown in Figure 2 Leave Update. An update from a top-node jumps over the critical bound and falls into the filter of non-top-nodes. For example, as shown in Figure 2 It is obvious that if an update is an internal or a join one, then only the relevant top-node whose filter covers the updated sensor reading needs to be probed to reevaluate the top-result. Otherwise, a leave update from a node may have to probe all non-top-nodes to look for the new top-th node. This introduces high energy consumption.
(3) Filter Updating. When sensor readings pass their filters, the filter settings of corresponding sensor nodes need to be recomputed and the base station will send the new filtering windows to the nodes for installation. There are two approaches for updating the filter of each node in FILA.
Eager Filter Update. If a new filtering window is different from the old one, then the new filter computed by the base station is immediately sent to the node to replace the old one. When top-result set changes, the filters of the sensor nodes will also change. We demonstrate above two updating policies using the example shown in Figure 3. In Figure 3, we only consider four sensor nodes , +1 , +2 , and +3 . When the readings V +1 , V +2 change, the four filters will also change. According to the eager filter update policy, the base station needs to send all the four filters to the sensor nodes. In contrast, according to the lazy filter update policy, due to the fact that new filter windows of nodes , +3 contain the old ones, that is, the base station only needs to send the filters of nodes +1 , +2 while delaying the filter update of nodes and +3 . We can see that eager filter update consumes much energy for updating the filters of all sensor nodes even when only one sensor reading violates the filter. The lazy filter update consumes less energy because the sensor nodes with the new filter windows containing the old ones remain unchanged.
We can see that lazy filter update policy is naive and not energy-efficient. Due to delay of the filter updating to some sensor nodes, the gap between adjacent filters will occur as illustrated in Figure 4 which decreases the filtering capability. In addition, lazy filter update is a suboptimal method because the total number of filtering updates remains in a high level and consumes much energy though the number of updates is reduced for one time filter update.

Defects of FILA.
To some extent, FILA avoids redundant data from transmission and saves energy. However, there is massive unnecessary energy consumption. Assume that after sampling in epoch , the filtering windows are shown in Figure 5 [5]. In Figure 5 the value V 3 jumps out its filter and falls into the filtering window installed on V 4 . The base station needs to adjust the filtering windows for relevant nodes which are shown as Figure 5(c). However, after sampling in epoch +1 as shown in Figure 5(d), the value V 3 changes again and jumps into the frontal window. In FILA, fluctuation of data will cause frequent updates of the filter windows which will incur large amount of communication cost. Mai et al. [5] proposed an updating algorithm based on prediction which determines whether to update the filters according to the possible cost on updating. The algorithm decreases the cost on updating the filters to some extent. However, this approach is limited by the prediction performance of linear regression model.
In addition, from Figure 5, the values on the non-topvary in a small range; their filtering windows are useful to filter out irrelevant sensor nodes. However, the upper bound of non-top-nodes is determined by the values of th and ( + 1)th sensor nodes; when the value which is ranked at th varies, the filters on non-top-will be affected directly. For a wireless sensor network generally consists of large number of sensor nodes, is usually small in a top-query while a large number of sensor nodes fall into non-top-node set. If the algorithm updates the filters once reevaluating the topresults, it will consume more energy on updating the filters of non-top-nodes.
In this paper, we adopt Gaussian process regression to alleviate the burden of filter update in FILA. When the filters change, the sensor readings are predicted by Gaussian process regression to calculate the updating costs of filters. An algorithm for top-processing under FILA framework named FUGPR is provided. FUGPR reduces the cost of filter updating when fluctuation of sensor readings occurs and prolongs the lifetime of the wireless sensor network.

Filter Updating Algorithms
FUGPR is efficient top-processing method based on predicted benefits of Gaussian process regression [22,23]. Gaussian processes let the data "speak" more clearly for themselves. Gaussian processes extend multivariate Gaussian distributions to infinite dimensionality. Formally, a Gaussian process generates data located throughout some domain such that any finite subset of the range follows a multivariate Gaussian distribution. Under Bayesian linear regression framework, a Gaussian process predictor calculates posteriors from priors over functions rather than from priors over parameters [24]. Gaussian process regression has been widely exploited in many research topics, such as human motion estimation and time series prediction [25]. Recently, Gaussian process approach has been adopted for predicting of monitoring models over distributed data streams [26] and real-time sensor data processing in wireless sensor networks [27].
In FUGPR, each node in the top-list has a separate filter while all the remaining non-top-nodes share a common one. This setting is along with the framework of FILA. The + 1 filters must not overlap with one another. If the reading of a sensor node at next epoch falls in the filter, the reading will not be sent to the base station. Otherwise the violated reading needs to be sent to the base station. Due to limited storage of sensor nodes (e.g., MICAz and telosB only have 4 KB and 10 KB RAM, resp. [28]), each sensor node only preserves current reading and violated readings from its child nodes. The maximum number of readings stored for the sensor nodes are the ones connecting directly with the base station which is decided by the number of layers of the spanning tree (which is a value not less than 100). For typical sensor nodes such as MICAz and telosB, the storage is enough to preserve these readings. The base station will reevaluate  the filter setting of all the nodes based on the violated readings. According to the analysis in Section 3.2, the reevaluation process consumes much energy for filter updating. Instead, we utilize the historical sensor readings of each node to construct Gaussian process models to predict further values of each sensor reading. Then, the costs by utilizing the old filters as well as the new filters are evaluated which decides whether or not to update filters. The predict benefits of FUGPR can save much energy for avoiding updating filters frequently.

Gaussian Process Regression for Prediction.
We predict next epochs' readings of a sensor node on their temporal correlations. The very simple example is to predict next epoch's sensor value based on current reading. For step prediction, we assume ( ) is current reading and ( ) the predicting value of next epoch. We can get a dataset , = {( ( ) , ( ) ), = 1, 2, . . . , , ( ) = ( +1) }, which can be seen as the input of the Gaussian process regression. However, it is obvious that this method will lead to big error margin for predicting next epoch's sensor readings based on current reading is far from enough.
To facilitate the calculation, we only focus on energy consumption resulting from communication and omit other aspects of energy consumption. The assumption is reasonable for the communication of wireless sensor networks is main aspect of energy consumption. The total amount of energy spent in sending a message with bytes of content is given by + , where and are the per-message and per-byte sending costs, respectively. And the total amount of energy consumption in receiving a message with bytes of content is given by + , where and are the per-message and per-byte receiving costs, respectively. As an example, typical values for MICA2 motes, receiving cost is defined analogously, with typical values of and roughly 60% less than their sending counterparts. To calculate quantitatively for our FUGPR, we set = 0.4 , = 0.4 and the values of , , , and are displayed in Table 1, respectively [29].
From Table 1, we can learn that the energy consumption of communication is higher than data transmission. In order to reduce the number of communications, each transmission the sensor node sends more bytes for saving energy. In FUGPR, readings fallen in their corresponding filter need not be sent to the base station while the violated sensor readings along with their past readings are sent to the base station together. For each sensor node, the value is learned from the historical data at the base station. The value can not be a large number because transmitting too much data will also consume much energy. In this paper, the value varies from 1 to 5. We choose the value with the minimum squared error (MSE) for the Gaussian process regression in FUGPR. At the base station, for each sensor node we use Gaussian process regression model on historical dataset . The input is the sensor readings X( ) and the output is the predicting value ( ). Let X = [ −1 ⋅ ⋅ ⋅ − +1 ] and = +1 .
To each sensor node, we use several past reading as the input and the reading at next epoch as the output; we can obtain the following equation through a Gaussian noise model: Users often expect the underlying function ( ) to be linear and a least-squares method can be exploited to fit the straight line. In DAFM, Mai et al. [5] adopted linear regression method to predict next epochs' sensor values. Typical linear regression models for prediction have the following formula: where is random error satisfying Gaussian distribution (0, 2 ) and the parameter , can be estimated by the leastsquared method.
Due to various sensor readings from real applications in wireless sensor networks, the prerequisites of linear regression method are not satisfied. Besides linear regression models, other analogical models may use the principles of model selection to choose among the various possibilities for we suspect ( ) may also be quadratic, cubic, or even nonpolynomial. Instead of claiming ( ) relates to some specific model, a Gaussian process extends multivariate Gaussian distributions to infinite dimensionality and lets the data stand for themselves. In geostatistics, Gaussian processes are known as "Kriging, " but the input space in the literature only concentrates two or three dimensions while Gaussian processes consider more general input spaces.
We illustrate Gaussian process regression as shown in Figure 7. For simplicity, we consider one-dimensional input space. The solid points are real sensor readings, the solid line indicates an estimation of prediction values for the real sensor readings, and the shaded region denotes 95% confidence intervals along with the regression line. If current sensor reading of a sensor node is * , the sensor value * of the next epoch along with the predicted value * predicted by Gaussian process regression satisfies a Gaussian distribution. Consider * ∼ ( * , 2 ) . (4)

Gaussian Process Regression.
A Gaussian process is a collection of random variables, any finite number of which has a joint Gaussian distribution. In general, a Gaussian process is completely specified by its mean function ( ) and covariance function ( , ). Usually, for notational simplicity we will take the mean function to be zero. The covariance function ( , ) represents the influence of different sensor readings.

International Journal of Distributed Sensor Networks
If there are items in the training dataset , a × positive definite covariance matrix can be constructed by ( , ) as follows: ] .
When a new sensor reading * arrives, the covariance matrix * of previous data items and * * of itself are shown as follows: where * * = ( * , * ). The estimation of sensor readings The joint function of y and * is also a multivariate Gaussian function. Consider According to Bayesian rules, we can obtain * | y ∼ ( * −1 y, * * − * −1 * ) .
When given a new sensor reading * and a confidence level 1 − , the confidence interval is * ± /2 .

Training Gaussian Process Regression Models.
How well does the selection of the covariance function decide the reliability of our regression process? The procedure mainly lies on the characteristics of the training sensor readings. In the Gaussian process literature, a popular choice is the squared exponential (SE) covariance function which is also adopted in our paper. Consider When folding the noise covNoise = 2 ( , ) into ( , ), the SE covariance function will be In the above equation, if = ( , the covariance function is named covSEard, that is, squared exponential covariance function with automatic relevance determination (ARD) distance measure.
For ∼ (0, ), we obtain The log-style log ( | , ) = ( ) of the equation is Simply run the favorite multivariate optimization algorithm (e.g., conjugate gradients, Nelder-Mead simplex) on this equation and a pretty good choice for will be found.

Optimization Strategies of Filter
Updating. If the new reading of one sensor node is without its filters, then an update is sent to the base station. Now, the update is jump into other nodes' filter; the top-results become undecided at the base station. Therefore, the base station should probe some related sensor node(s) to reevaluate the top-results. These nodes which should send their new readings to the base station need to wrap up the readings of current and past epochs and then utilize Gaussian process regression for prediction to ascertain the means and variances of future readings. For these sensor nodes whose readings are not updated, the means and variances of their future readings still maintain the recent ones.
Assume the new reading of a node at epoch +1 is V , the predicted new reading is V pre , and its variance is 2 ; Let be the average number of recent epochs that 's readings do not violate 's filter. Assume that we have run for 100 epochs. For the convenience of description, we only consider the sensor node from epochs 80th to 100th. Assume that 's reading violates its filters at epochs 84th, 91th, 94th, and 100th. Then, the numbers of epochs that 's reading does not violate its filters are 91 − 84 − 1 = 6 epochs, 94 − 91 − 1 = 2 epochs, and 100 − 94 − 1 = 5 epochs. So, the average number of epochs of nonviolation is = (6+2+5)/3.
If the base station needs updating the filters, the cost of using the new filters at the future epochs is referred to as cost new . Otherwise, the cost of using the old filters is denoted as cost old . If cost new < cost old , then the base station sends updated filters; otherwise, all the sensor nodes maintain the old filters. cost new and cost old are calculated as follows: In formula (18), | | denotes the number of sensor nodes which need to update their filters; | | denotes the number of nodes in the wireless sensor networks. The result of cost new and cost old for comparison has nothing to do with | | × . Hence, we do not need to calculate | | × and consider the specific value of .

Experiments Settings.
We use two real datasets: Intel Lab data and LEM data to conduct our experiments.
Intel Lab Data. The set contains information about data collected from 54 sensors deployed in the Intel Berkeley Research Lab [30]. We partition the sensor network into some nonoverlapping regions, making the sensors with positions close to each other in the same area. In this paper, we do not research in depth in how to partition regions more reasonably and only partition the sensor network based on location information by using the -means algorithm. Taking an example of the Intel Berkeley Research Lab wireless sensor network, we partition it into 10 regions. Figure 8 shows the result of region partitioning, where the dark nodes denote the initial center points. We adopt temperature, humidity, and voltage data on March 1, 2004, as experimental data and the data are collected every 31 seconds. We evaluate the performance of the proposed algorithm FUGPR by MATLAB simulations.
LEM Data. The dataset is collected from the Live from Earth and Mars project at the University of Washington [31]. We adopt temperature, dew point, and sea level pressure collected from December 1, 2012, to December 1, 2013, as experimental data. There are 509,174 records in total. The dataset has missing values in some epochs and we have filled the missing values with the average of the readings at the prior and subsequent epochs. Each attribute of the dataset has 509,174 readings. We extract many subsets, and each subset contained 3,000 readings. The subsets are used to simulate the physical phenomena in the immediate surrounding of different sensor nodes.
We simulated a single hop network and two multihop networks. The single hop network has 12 sensor nodes as shown in Figure 9(a). The two multihop networks contain 8 × 8 = 64 and 12 × 12 = 144 sensor nodes, respectively. The network consisting of 84 sensor nodes is shown as Figure 9(b), and we number the nodes. The network layout of 12 × 12 is similar to 8 × 8, is omitted for simplicity. To simulate the spatial correlation of sensor readings, the subset starting at successive time is assigned to neighboring nodes in the simulated networks.

Experimental Results.
We compare our FUGPR with several typical methods, TAG, range caching, and FILA. TAG involves several basic aggregation methods which can be seen as the standards of energy consumption. In range caching method, if two filters are overlapped, the base station may request sensor values to decide final top-results. In our experiments, the filter window value for range caching method varies from 0.3 to 3.2 and energy consumptions under different values are calculated. The value with the minimum energy consumption is chosen as the final filter window value in our experiments. For FILA method, we adopt lazy filter update policy for our experiments. For our FUGPR method, we use the GPML MATLAB toolbox (http://www.gaussianprocess.org/gpml/) for prediction by our proposed Gaussian process regression models. Figure 10 shows the comparison of energy consumption for TAG, range caching, FILA, and FUGPR after 1000 runs of top-queries under Intel Berkeley Lab data: temperature, humidity, and pressure. We can see that FILA and FUGPR adopt filter mechanism which avoids redundant data transmitted to the base station, thus saving much energy. FUGPR optimizes the filter updating procedure which reduces the energy consumption for frequent filter updating. In Figure 10, with the increase of the values, FUGPR is much better than FILA for energy consumption. The reason is that there will be many violated sensor readings needed to be sent to base station which requires filter reevaluation in FILA method. In contrast, our FUGPR does not need to do much filter updating operations under prediction benefits by adopting Gaussian process regression models. FUGPR utilizes the predicted sensor values to evaluate the costs of using updated filters and old ones in next epochs, namely, cost new and cost old . The small one will be adopted and no necessary filter updates can be avoided in FUGPR. Figure 11 shows the comparison of energy consumption for TAG, range caching, FILA, and FUGPR after 1000 runs of top-queries for single hop sensor network. FILA and FUGPR can filter many sensor readings, especially when is small. When increases (>7), we can see that FILA consumes much energy. For the dew point data, when is 9, the energy consumption of FILA is even larger than TAG method. The reason is that the fluctuation of data causes the energy consumption of requesting sensor node readings by the base station and updating the filters exceeds the energy consumption saved by filter benefits. In contrast, our FUGPR method is prior for all the circumstances. Figures 12 and 13 show the comparison of energy consumption for TAG, range caching, FILA, and FUGPR after 1000 runs of top-queries for two multihop sensor networks, 8 × 8 and 12 × 12, respectively. The same situation occurs for FILA method and our proposed FUGPR method is prior to the other three methods.

Conclusion
This paper combines Gaussian process regression models into FILA method and proposes FUGPR method for topquery processing in wireless sensor networks. The prediction benefits of the Gaussian process regression models reduce the cost for filter updating caused by fluctuation of sensor values. Experimental results show that under two distinct real datasets, for example, Intel Berkeley Research Lab data, LEM data, and simulated single hop and multihop sensor network structures, our proposed FUGPR is prior to TAG, range caching, and FILA methods and thus reduces the energy consumption and prolongs the lifetime of the sensor networks.
International Journal of Distributed Sensor Networks