An Exact Top-k Query Algorithm with Privacy Protection in Wireless Sensor Networks

Aimed at the Top-k query problems in the wireless sensor networks (WSN) where enquirers try to seek the k highest or shortest reported values of the source nodes, using privacy protection technology, we present an exact Top-k query algorithm based on filter and data distribution table (shorted for ETQFD). The algorithm does the query exactly and meanwhile uses conic section privacy function to prevent the disclosure of the real data and then to promise the security of nodes in network. In this proposal, each node in WSN uses data distribution table to reflect the distribution of its own data and keeps exact filter to just return data which is possibly to be the result of the query, so as to reduce the energy cost of network and prolong network lifetime. In addition, data of node is packaged with a privacy protection function based on conic section. The algorithm's performance is examined with respect to a number of parameters using synthetic datasets. Performance analysis and simulation results illustrate that, compared to existing algorithms, ETQFD is energy-efficient and the network lifetime in our algorithm is longer. At the same time, privacy of data in this algorithm is also guaranteed.


Introduction
Wireless sensor networks are widely used in a variety of monitoring applications, including habitat monitoring and environmental observation.In many of these monitoring applications, measurements from sensors are periodically collected at the base station to answer monitoring queries submitted by users.There exist many situations where one is interested in a monitoring extreme and a typical behavior, for example, finding the highest pressure points in a pipeline, the largest temperature values in a building or patch of forest, the points of the most intense vibration in a bridge, and so forth.Oftentimes, one may be interested only in the Top-k observations.Top-k query, as a basic aggregate query, continuously returns the k nodes with the highest (or lowest) sensor readings.This type of queries is often used in practice since they give compact but useful information about the monitored environment.Hence, how to execute Top-k query in datacentric wireless sensor networks is a research hot spot.While the sensor nodes in the network have limited availability of recourses like energy and computational power, we must take energy consumption into consideration when doing queries.At the same time, the sensor network could be easily undermined by adversaries and thus they will obtain privacy information of sensor nodes.Many scholars and experts are dedicated to the research of Top-k query algorithm with highly efficient and low energy consumption.However, we should also execute Top-k query with privacy protection strategy.
Some representative Top-k query algorithms are as follows: an in-network aggregation technique, known as TAG [1], an algorithm called PROSPECTOR-PROOF [2], and a novel filter-based monitoring approach called FILA [3] The algorithm FILA, which firstly introduced the concept of 2 International Journal of Distributed Sensor Networks filter, used a query revaluation algorithm and a filter setting algorithm to ensure correctness and validity.However, it costs a lot in the process of filer setting.Recently, Thanh et al. [4] proposed a distributed query revaluation algorithm and a filter setting algorithm based on linear regression model.Liao and Huang [5] proposed an energy efficient monitoring scheme, in which the network is divided into several grids, to decentralize network traffic.Liang et al. [6] took query response time into consideration for the first time and proposed a union query optimization framework to find the optimal query tree, but it has some limitations.Zhu et al. [7] processed Top-k in k-nearest neighbor network and reduced energy consumption by shutting or waking up nodes, yet it is not applicative in updated network.Malhotra et al. [8] advanced an exact filter algorithm to ensure accuracy of query results.All in all, we need to explore a Top-k query algorithm with low energy consumption and high accuracy.
As WSN is a data-centric network, privacy of location and data appears important.The traditional network data is encrypted by key; however, it is not feasible for WSN for high energy consumption.Chen and Lou [9] proposed four models to protect privacy of nodes and sink: Forward Random Walk, Bidirectional Tree, Dynamic Bidirectional Tree, and Zigzag Bidirectional Tree.Jayashree and Sathish Babu [10] raised a privacy protection mechanism STDMA based on topology tree and dispersion mechanism, in order to protect privacy and conserve energy by minimizing the number of intermediate nodes from one given node to the sink.Dimitriou and Sabouri [11] put forward a privacy protection model based on random location and random orientation.Groat et al. [12] concealed sensitive data into a group of fake tables to protect real data, but the process is too complex.
In this paper, we propose a Top-k query algorithm based on data distribution table and filter which can obtain balance between energy consumption and accurate query results.We aim to execute accurate query with lower energy consumption and longer network lifetime.In most Top-k query algorithms, all nodes in the network need to send their sensing data to the sink node in the initial stage, while nodes in our algorithm only send their data distribution tables which occupy just several bits to their father nodes, and then only those nodes determined by the sink node after the distribution phase will send their data.In our algorithm, the number of nodes sending data to the sink node in the initial phase is much smaller than that of all nodes in the network.To reduce energy consumption, we set more restricted filters.We use several sets to reflect different data ranges and then, according to the sets, each node maintains a threshold value which is more restricted.This will avoid many unnecessary update messages.To obtain accurate query results, nodes in the set "Top-k" will always send their new sensing data to the sink node to ensure that sensing data in the set is always real time and accurate.Although the k nodes may cost more energy to some extent, k is so small compared with the number of all nodes, and nodes in the "Top-k" set are changeable, so this kind of cost is acceptable in order to obtain accurate results.To deal with hidden trouble in security, we propose a privacy protection strategy based on the conic section to protect real data.

Top-k Query Algorithm Based on Data Distribution Table and Filter
While doing query, if all sensor nodes in the network participate in comparing or returning, a large amount of redundant information will be generated and that will also lead to great query delay.While that will be much better if the sink node can accurately locate the area of interest and directly obtain data from those regions, in this paper, we try to exactly locate the nodes having influence on query results and then only send query request message (QRM) to those nodes.To realize that, every node in network maintains a copy of DDT, which reflects the distribution information of sensor data.When receiving QRM, every node sends the request to the influential nodes according to query parameter k and the DDT.Furthermore, we use exact filter to avoid unnecessary data transmission and provide real-time and accurate query results.

Description of DDT.
Every sensor node maintains a copy of DDT; the first row of that DDT expresses the distribution information of itself and other rows show the distribution information of its child nodes.The structure of DDT is shown in Figure 1.For instance, assume that DDT is comprised of  rows and  columns, we divide the range of sensing data into  nonoverlapping intervals and Column L means the Lth interval.If the sensing data is in the Lth interval, then value of Column L, Row 1 of the DDT is set to 1 and other columns of Row 1 is set to 0. Intervals can be divided according to data intensive degree and the higher the degree is, the more the intervals are.

Establishment and Maintenance of DDT.
Before query execution, each node in the network needs to initialize its DDT and after the initialization DDTs are aggregated from the leaf nodes to the sink node along the query tree.

Initialization of DDT.
The structure of the network is hierarchical cluster-layer topology architecture and nodes in the network not only sense data but also forward data.After sensing data, each node determines its DDT.For example, if a node's sensing data is in interval   , then Column k, Row 1 of the node's DDT are set to 1 and other columns of the first row are set to 0. Each node sends its DDT to its father node starting from the leaf node.When receiving DDT from child nodes, the father node fills in other rows of its DDT according to DDTs from its child nodes.However, the leaf nodes only maintain the distribution information of themselves and there is only one row in their DDTs.

Establishment of DDT.
Each node sends its DDT to its father node starting from the leaf node finally to the sink node.There are two kinds of child node; one is the leaf node and another is nonleaf node.The leaf nodes send the onerow DDT to their father nodes.While other nodes firstly accumulate values of the same column in different rows of the DDT and thus new DDT with single row is generated, then they send the new DDT to their father nodes.In the new DDT, the value of Column k means the number of nodes whose data is in the kth interval among the tree rooted by the nonleaf node.Figure 2 describes the establishment of DDT. 3 is a nonleaf node and  1 ,  2 are the child nodes of  3 .Data of  1 is in the 6th interval, and that of  2 ,  3 is in the 5th interval, respectively.After initialization,  1 ,  2 send their DDTs to  3 .And then  3 fills in the second and third row of its DDT according to DDTs from  1 ,  2 .Afterwards it accumulates values of the same column and sends the new DDT to its father node  4 .Other nodes will do the same thing until the sink node receives the information.From the aggregation result of  3 , it can be judged that there are 2 nodes located in the 5th interval and 1 node in the 6th interval among the query tree rooted by  3 .

Update of DDT.
Since WSN is a network continuously sensing data with dynamic topology, updating DDT with appropriate frequency is necessary to exactly reflect real situation of the network.Update of DDT plays an important role in query algorithm.
Update of DDT is described as follows.
A node continuously senses  records.Only if the  records are in the same interval but not in the previous interval whose value is 1 in the first row of the node's DDT, value of the previous interval should be set to 0 and that of current interval should be set to 1.If DDT is changed, the node will send the new information to its father node.

QRM Distribution of Query Algorithm.
Each node in the network maintains a copy of DDT which reflects the data distribution information of itself and its child nodes.Take node   , for example; we can obtain two kinds of information from its DDT: (1) number of nodes in different intervals in the query tree rooted by node   via accumulating values of each column; (2) the distribution information of the subtree rooted by node   from other rows except the first row of node   .
QRM dispatches the query from the sink node to the source node by top-down way.Assume that (  ,   ) represents the accumulated value of the th column in DDT of node   . represents the number of intervals and then ∑  =− (  ,   ) represents the number of nodes located in the  highest intervals.According to ∑  =− (  ,   ) and query parameter k, we calculate the minimal  which satisfies the formula: ∑  =− (  ,   ) > .If each value of the  highest intervals of one node's DDT is 0, then nodes in the tree rooted by this node make no influence on query results.When a node finds that its child node makes no influence on query results, it will not send QRM to the child node and its subtree.While receiving QRM, a node will decide whether to send the request to its children according to DDT and if necessary it will send the request to the leaf node.Each node that receives the request and has effect on the results will send its data to the sink along the query tree.

Update of Query Algorithm.
After distributing QRM for the first time, there are sum records sent to the sink.We generate a set Sum containing the sum records.Then we select the k highest records in set Sum into the set Top-k, and others into another set nTop-k.We use set nSum to describe records not in set Sum and it is not the real set saved in the sink node.We use min to reflect the minimum value in set Top-k and low to reflect the boundary value between set Sum and set nSum. Figure 3 shows data distribution of the sets.
As sensing data changes with the environment, it is necessary for nodes to send update information periodically to  obtain exact records.On one hand, there are so many nodes in the network and it is obviously inadvisable for all nodes having effect on query results to send update information.
On the other hand, recently even if filters are used in nodes, the sink node can only obtain approximate data.In this paper, each node has a threshold , and only when the sensing data is higher than the threshold, the node will send update information to the sink node.In addition, we generate different thresholds for different nodes; thus, we can protect the minimum and maximal value and prevent attacker from analyzing characteristic of the whole network.We set threshold  as follows.
(1) For Nodes in Set Top-k,  = −∞.To ensure that the sink node obtains exact data, nodes in set Top-k should send update information the moment they update.
(2) For Nodes in Set nTop-k,  = min.Nodes in set nTopk do not have effect on the final query results, so it is unnecessary for them to send update information only when their values are higher than min.
(3) For Nodes in Set nSum,  = low.Nodes in set nSum do not have effect on the final query results.However, if a node in set nSum jumps into set Sum, it may make difference to final results and it should send update information.
When receiving data from nodes, the sink node can identify the nodes and type of messages.There are five situations.
(1) Update Message (UM) of Nodes in Set Top-k.The sink node receives message and finds out whether the value of the sending node is lower than min.If the value is lower than min, then the sink node deletes the node from set Top-k, puts it into set nTop-k, and then informs the node to change its threshold  into min.
(2) UM of Nodes in Set nTop-k.The sink node receives message, deletes the node from set nTop-k, puts it into set Top-k, and then informs the node to modify its threshold  = −∞.
(3) UM of Nodes Not in Set Sum.The sink node receives message, records the node, and then compares its value with min.If the value is higher than min, then the sink node puts the node into set Top-k and informs the node to update the threshold  = −∞.
Otherwise, the sink node puts it into set nTop-k and then informs the node to update the threshold  = min.
(4) Query Feedback Message (QFM).We will discuss it later in this section.
(5) Inspection Feedback Message (IFM).We will discuss it later in this section.
After a period of time, there may be less or more than k nodes in set Top-k (a set maintaining query results) or less than k nodes in set Sum. Assume that there are k  nodes in set Top-k and sum nodes in set Sum.The sink node periodically compares k with k  and decides what to do as follows.
(1) If k  > k, which means there are more nodes in set Top-k than necessary, in this case, the sink node performs Operation P: recalculating min; generating new sets; sending the new min to nodes in set nTop-k; and substituting the new min for their threshold , meanwhile notifying nodes in set Top-k to set their threshold  = −∞.
( With the execution of the algorithm, nodes in set Topk may jump into set nTop-k, or contrarily, and nodes in set nSum may jump into set Sum.So the values of some nodes in set nTop-k may be less than low; thus there exists some useless nodes in set Sum which leads to false decision by the sink node.To solve this problem, the sink node periodically sends inspection message to nodes in set nTop-k with a parameter "low".The nodes which receive the message will set the threshold  = low and send IFM to the sink node when its data is less than low.If the sink node receives an IFM, it deletes the node from set Sum and meanwhile puts it into set nTop-k.

Privacy Protection Strategy Based on Conic Section
Unlike traditional network, WSN is a network with limited energy, so that traditional key management is difficult to use in WSN.In this paper, the proposed privacy protection strategy based on conic section abandons key encryption method and simplifies the problem to protect the real data.

Description of Privacy Protection.
Before sending sensing data, the nodes repackage the sensing data using privacy protection strategy to prevent attacker from stealing and analyzing real data.Assume that () is privacy protection function and { 1 ,  2 ,  3 , . . .,   } are records of node   in a period.We use privacy protection function to repackage those real data as {( 1 ), ( 2 ), ( 3 ), . . ., (  )}.
To find privacy protection function is extremely significant.On one hand, the privacy protection function should be difficult to recover real data from it.On the other hand, sometimes the sink node does not need to know the values of the real data but the ordering relationship among them.For an instance, users may just want to know the k highest regions rather than the concrete data from those regions and so that there is no need for sink node to obtain real data but just their relationship of sequence.Therefore, the privacy protection function should be monotonically increasing.
The ordering relationship of data after using privacy protection function is the same as that of real data, and if attackers obtain data in network, they will know the relationship of real data.To solve this problem, each node in the network has a value of ID () distributed by the sink node which maintains information form of all nodes' ID.Each node multiplies its  and the packaged data.Take node   as an example; data after transformation is   * {( 1 ), ( 2 ), ( 3 ), . . ., (  )}.
In this paper, we present a normalization method on real data; for example, data   after normalization is    = 2 * ((  − MinValue)/(MaxValue − MinValue)) − 1; MaxValue and MinValue are the given maximal and minimum value of sensing data, respectively.
The specific procedure of a node (take   as an example) sending data to the sink node based on the conic section privacy protection strategy is as follows.
(1) Node   takes normalization on its real records { 1 ,   The sink node keeps a conic volume table, which is composed of all kinds of different volumes obtained from, respectively, cutting the cone by all kinds of different straight lines whose vertical distances to conical bottom center are distinguished from each other.When the sink node receives data from nodes, it firstly computes the privacy protection value using the corresponding ID saved in it and then calculates the normalized data via simulating the straight line cutting the cone; finally it concludes real data through reverse normalization.

Privacy Protection Function. Assume that 𝑝𝑓(𝑥) is privacy protection function and it should satisfies the fact that if
On the basis of that, we propose a privacy function based on conic section.If we use a straight line to cut a cone, the larger the vertical distance between the line and the center of conic bottom is, the smaller the subvolume will be (as shown in Figure 4).That is, along the arrow in Figure 4, the larger the vertical distance between the line and point A, the bigger the volume of the rest part after the line cutting the cone.We assume that the radius of this conic bottom is 1 unit.
We firstly compute the value of the subvolume.Assume that () is the value of the cone's subvolume and  is the vertical distance between center of conic bottom  and the straight line.In Figure 5, ℎ 0 is height of subvolume and ℎ and  are integration variables.
According to geometry and calculus, we can deduce International Journal of Distributed Sensor Networks We may set  =  = 1, and because ℎ 0 =  * ( − )/, so We set 1 − ℎ = ; then  = −ℎ: Furthermore, we can obtain Then, the privacy protection function () can be defined as follows if we suppose the volume of this cone is

Simulation Initialization.
We evaluate the performance of our approach through MATLAB simulations and compare our approach with EXTOK algorithm and FILA algorithm.
In the simulation, all nodes are evenly distributed in a grid area (400 m * 400 m) and the sink node is located in the center.Each node has the given original energy and, when sending data, it subtracts the sending energy from its remaining energy.We use average remaining energy of each node in the network at specific time points to indicate energy consumption and we use the time when half the number of nodes in the network have died to indicate the network lifetime.We use data of artificial sensing temperature, the value scope of which is [−1, 45].

Energy Consumption of Query Algorithm.
Assume that a node may send  bits data in its cycle and transmission distance is ; then energy consumption of sending data is And energy consumption of receiving data is The simulation parameters are set as follows: To evaluate performance of the proposed algorithm in many aspects, we change total number of nodes in network (defined as ) and the probability of data variety (defined as ) and we, respectively, execute Top-k (k = 5, 10, 15, 20, and 25) queries.We assume that  changes from at least 20% to at most 60%.Average results of experiments for 200 times are shown in Figures 6 and 7.
Figures 6(a) and 6(b) show that the Top-k query algorithm in this paper consumes less energy than FILA and EXTOL in the same circumstance.Figure 7(a) displays that, when data of nodes changes frequently, the average remaining energy in our algorithm declines more slowly, which indicates that our proposal performs better than FILA and EXTOK even in the network with frequent update.Figure 7(b) reveals that our algorithm consumes less energy than FILA and EXTOK when executing Top-k with different values of k. Figure 7(c) shows average energy cost of nodes in different algorithms when changing total number of nodes.Our algorithm also performs better than FILA and EXTOK in large-scale network.Those results indicate that our algorithm performs well in energy consumption.

Network
Lifetime of Query Algorithm.The simulation parameters in this section are the same as those in above experiments except for the simulation time.If the remaining energy of a node is less than 0.1% of its original energy, we mark the node as death node.We define the network lifetime as the time when half the number of nodes in the network have died and we define the network size as the total number of sensor nodes.Average results of experiments for 200 times are shown in Figures 8 and 9.
Figure 8 summarizes the lifetime performance of different algorithms.Each point in Figure 8 represents an average network lifetime with different network size.In general, the network lifetime increases linearly as a function of the network size.Figure 8 shows that our algorithm has longer network lifetime than FILA and EXTOK when executing Top-k query in the same network.In addition, the larger the network size is, the better our algorithm performs.The results  imply that our algorithm can efficiently prolong network lifetime.
We calculate the number of nodes remaining in network to illustrate the network lifetime in another way.Average results of experiments for 200 times are shown in Figures 9  and 10.From the figures, we can judge the network lifetime from the downward trend of residual nodes to some degree.
Figure 9 proves that, at the beginning, nodes have enough energy to execute queries, and then more and more nodes become death nodes, while the rate of death of our algorithm is slower than that of FILA and EXTOK.Figure 10(a) shows that our algorithm performs better than FILA and EXTOK when executing Top-k with different values of k. Figure 10(b) displays that our algorithm is also superior to FILA and EXTOK in the network with frequent update.

Security Analysis of the Privacy Protection Strategy
Based on Conic Section.The security of our strategy can be concluded as follows.(1) Even if attackers obtain data transmitting in network, it is difficult for them to compute the real data.The difficulty to decipher the privacy protection function to get real data relies on the complexity of the function.However, the privacy protection function based on conic section is complicated as its formula contains logarithmic function and antitrigonometric function.
(2) The privacy protection function is monotonically increasing and attackers may get the ordering relationship of the real data even if they do not gain them.To solve the problem, ID is used and every node only keeps ID of itself, so the attackers may encounter obstacle by analyzing data packaged with ID value to achieve the sorting sequence.
Energy consumption produced by privacy calculation is negligible compared with that caused by data transmission.In order to analyze the performance of our Top-k query algorithm with privacy protection, we take the quantities of data into consideration.We assume that there are two types of data: the first is simple data that occupy 1 data unit and the second is sensing data that occupy  data units.We assume that there are row child nodes of each nonleaf node in the query tree and so there are row + 1 rows in a DDT.Suppose that there are col columns in a copy of DDT and there are  nodes in the network.And then the depth of the tree is depth = ⌊log (row−1) row + 1⌋.Each node in the network is updated once every  minutes.Table 1 shows the quantities of data in different phases.To simplify the problem, we assume that there are depth hops from each node to the sink node and in fact there are less than depth hops.
In Table 1, we can deduce At the beginning, each node creates its DDT and sends DDT to its father node, where it only sends col (col ≪ ) data units.And then each node sends 2 data units to its child nodes in QRM distribution phase (2 ≪ ).Meanwhile, the number of nodes in set Sum (defined as sum) is much smaller than that in set nSum (defined as -sum) and the bigger  is, the smaller sum is.Obviously, we can conclude that sum *  * depth + 2 * ( − 1) + col *  <  * depth ∑ =1 row −1 ( − 1) (11) according to (10), so data amount of initialization in our algorithm is much less than that in EXTOK.In the update phase, in our algorithm not all nodes need to send update message.And meanwhile, each node has a filter so that only International Journal of Distributed Sensor Networks k nodes have to send data all the time, thus reducing network traffic.Even in the worst situation, only sum *  * depth + 2 * ( − 1) data units are transmitted in network; however, in other algorithms like EXTOK, all nodes in set nSum need to send update message to the sink node in this situation.Therefore, our algorithm greatly reduces amount of data in network on the basis of obtaining exact results.

Conclusions
In this paper, we propose a Top-k query algorithm based on DDT and filter with privacy protection.The sink node distributes QRM according to DDT and only sends the request to nodes having effect to the results.Each node has a filter to avoid unnecessary transmission.In addition, we use privacy protection function to conceal real data.In the future work, we are absorbed on improving the security of privacy protection scheme while satisfactory efficiency guaranteed.

Figure 3 :
Figure 3: Description of sets that sink keeps.

Figure 6 :
Figure 6: Energy cost in network of different algorithms.

Figure 9 :
Figure 9: Number of residual nodes in different algorithms under the same conditions.

Figure 10 : 1 >
Figure 10: Number of residual nodes in different algorithms with different parameters.
When receiving the query value, the node compares it with its sensing data.If the sensing data is higher than or equal to the query value, then it sends QFM to sink node.The sink node receives QFM, deletes the node from set nTop-k, and puts it into set Top-k.Finally, the sink node regenerates set nTop-k and executes Operation P.(b) Consider kk  > m, which means there are not enough nodes in both set Sum and set nSum.While nodes in set nSum are not in order, under this circumstance, the sink node will distribute QRM again and recreate new sets.
If k  > k, which means there are less nodes in set Topk than necessary and kk  nodes are needed, assume  = sumk  , where m means the number of nodes in set nTop-k.There are two cases as follows.(a) Consider kk  =< m, which means there are enough nodes in set Sum to be added into set Top-k.In this situation, kk  nodes in set nTopk are needed.However, data of the kk  highest nodes in set nTop-k is not real time and it is not exact to supplement set Top-k using those nodes.While it is also not efficient for all nodes in set nTop-k to send update data to the sink node.In this paper, we use the value of the (kk  )th node in set nTop-k as a query value.The sink node sends the query value to nodes in set nTop-k.

Table 1 :
Comparison of data amount in network between our algorithm and EXTOK.Sending data to the sink node: less than sum *  * depth Update phase (every  minutes):