A Three-Phase Top- k Query Based Distributed Data Collection Scheme in Wireless Sensor Networks

We propose a three-phase top- k query based distributed data collection scheme which is designed for clustered or multisink wireless sensor networks. The proposed scheme consists of a distributed iterative hard thresholding algorithm and a three-phase top- k query algorithm. In the distributed iterative hard thresholding algorithm, the cluster heads or sink nodes reconstruct the compressed data in a distributed and cooperative manner. Meanwhile, the top- k query operation in the above algorithm is realized by pruning unnecessary elements among cluster heads or sink nodes in the three-phase top- k query algorithm. Simulation results show that there is no obvious difference in the performance of data reconstruction between our proposed scheme and existing compressive sensing theory based data collection schemes. However, both the number of interactions and the amount of transmitted data among cluster heads or sink nodes can be effectively reduced in the proposed scheme. The performance of the proposed scheme is analyzed in detail in this paper to support the claims.


Introduction
A wireless sensor network (WSN) is a cooperative network consisting of small, battery operated, and energy constrained wireless sensor nodes.Main applications of WSNs include the monitoring of the areas covered by sensor nodes through the collection of local surveillance data and the transmission of the collected data to one or more sink nodes in multihop communication for processing and analysis [1].WSNs can be deployed to support a series of new and exciting applications such as wildlife monitoring, disaster response, military surveillance, smart building, and industrial quality control [2].
However, limited resources in wireless sensor nodes in terms of computation, storage, communication bandwidth, and energy make data collection a very challenging issue.Two representative types of data collection schemes that have been proposed for WSNs are spatial-temporal correlation based data prediction schemes [3][4][5] and distributed source coding schemes [6,7].In the first types of schemes, the sensed data are derived using a data prediction model based on preconstructed spatial and/or temporal correlation in the server without being transmitted to the sink node through one or more hops.In the second types of schemes, a coding algorithm based on distributed Slepian-Wolf theory can be designed to reduce the number of transmitted messages among sensor nodes.However, both types of schemes still suffer from the shortcomings of high computational cost in the sensor nodes as well as high communication cost.For example, revised model parameters and coded sensed data need to be transmitted within the network from time to time.
Recently, some new ideas based on the compressive sampling theory have been introduced into solving the data collection problem in WSNs.In the classical Shannon sampling theory, a limited bandwidth of the signal is utilized, resulting in a high degree of redundancy in the sampled data.The compressive sampling theory aims to lift the limitation of the Shannon sampling theory by suggesting that a few random linear projections of a sparse or compressible signal would contain enough information to reconstruct the original signal [8].The characteristics of the compressive sampling theory, such as compressibility, asymmetry, versatility, and robustness, make it feasible to design sampling algorithms for WSNs [9].The compressive sampling theory has started to International Journal of Distributed Sensor Networks be applied to deal with the problems of data collection [10], localization [11], channel estimation [12], video streaming [13], and coding [14] in WSNs only recently.
In the area of research in data collection based on the compressive sampling theory, Luo et al. proposed the compressive data gathering (CDG) scheme [15] in which through adopting a chain-type topology and a distributed random coefficient projection method the sensed data are compressed among wireless sensor nodes and reconstructed in the sink node.Wang et al. proposed a data collection scheme based on the adaptive compressive sampling theory [16] by taking advantage of characteristics of the spatial-temporal inconsistency of the sparsity in the sensed data and by introducing an autoregressive model into the data reconstruction process.Xu et al. proposed a group of compressive sparse functions by utilizing the symmetric and orthogonal properties of the discrete cosine transform (DCT) functions so that sensed data can be reconstructed after the coefficients of the compressive sparse functions are recovered from the partial sensed data [17].Motivated by the chaos technology and compressive sampling theory, Lu et al. proposed a distributed secure and efficient data collection scheme by designing a chaotic sequence based sensing matrix generation algorithm and active node matrix algorithm [18].Wu et al. proposed a sparsest random scheduling scheme for compressive data gathering in wireless sensor networks based on a specially designed sparsest measurement matrix [19].Although the above data collection schemes can perform energy efficient data collection in WSNs, the compressed projections in each sensor node must be transmitted to the sink node in a multihop fashion and the sensed data can be reconstructed only in the server in a centralized manner.In a clustered or multisink WSN, the cluster heads or sink nodes are not able to obtain the sensed data unless the server retransmits the reconstructed results to them.However, such a compressionreconstruction-retransmission process is time consuming and less energy efficient.
In this paper, we propose a three-phase top-|| query based distributed data collection scheme for clustered or multisink WSNs.A key characteristic of our scheme is that, by collecting the local set of sensed data and exchanging the compressed projections of these data, the cluster heads or sink nodes are able to obtain the sensed data of the whole network without the need of retransmission from the server.Our proposed data collection scheme consists of a distributed iterative hard thresholding algorithm and a three-phase top-|| query algorithm.With the distributed iterative hard thresholding algorithm, the cluster heads or sink nodes can reconstruct the compressed data in a distributed and cooperative manner.With the three-phase top-|| query algorithm, the top-|| query operation in the above algorithm is realized in a distributed manner among cluster heads or sink nodes through pruning unnecessary elements in the data set.We show through experiment results that the proposed data collection scheme requires a few numbers of interactions and a smaller amount of data transmission among cluster heads or sink nodes than existing compressive sensing theory based data collection schemes while achieving the same performance for data reconstruction.
The main contributions of this paper are summarized as follows.
(1) We propose a distributed compressive sensing theory based data collection scheme in WSNs by utilizing the spatial sparsity of the sensed data.By decomposing the tasks of the iterative hard thresholding algorithm into several correlated parts, the cluster heads or sink nodes can reconstruct the original sensed data in a distributed, energy efficient manner.
(2) We propose a three-phase top-|| query algorithm to implement the hard thresholding operation   () in the iterative hard thresholding algorithm in a distributed manner.Differing from the traditional top- query algorithm which is only suitable to monotonic aggregation functions, such as the sum of nonnegative numbers, the proposed three-phase top-|| query algorithm can be applied to nonmonotonic aggregation functions.In particular, we use the sum of all numbers as the nonmonotonic aggregation function in this paper.
(3) We compare the performance of the proposed data collection scheme with some current compressive sampling theory based data collection schemes.We also perform a thorough analysis on the overall performance of the proposed scheme, including data reconstruction performance, the amount of transmitted data, and that of computed data.
The remainder of this paper is organized as follows.In Section 2, we briefly introduce the compressive sampling theory and the iterative hard thresholding algorithm.In Section 3, we introduce the system model.In Section 4, we describe the three-phase top-|| query based distributed data collection scheme.In Section 5, we evaluate the performance of the proposed scheme.Finally, in Section 6, we conclude this paper in which we also discuss our future research directions.

Compressive Sampling
2.1.The Basics.In the compressive sampling theory, the signal measurement process can be described as where  ∈ R  is the original signal,  ∈ R  is the measurement signal, and Φ ∈ R × ( < ) is the measurement matrix.The underdetermined system (1) has an infinite number of solutions for the reason of  < .In other words, the measurement signal  cannot be accurately reconstructed through the original signal .However, the natural signals are usually sparse in a set of appropriate transform bases although they are not sparse in the time domain.Therefore, the original signal  can be expressed as where Ψ ∈ R × is an orthonormal basis matrix and  ∈ R  is a  sparse vector.In other words, ‖‖ 0 =  and ‖‖ 0 indicates the number of nonzeros in .
By plugging (2) into (1), the signal measurement process can be described as where matrix  = ΦΨ ∈ R × is usually referred to as a sensing matrix.Although the underdetermined system (1) is ill-posed, its sparsest solution can be obtained by solving the following  0 optimization problem: However, the  0 optimization problem ( 4) is a NP-Hard problem [20].In practice, it is usually relaxed as the following  1 optimization problem: A key result in the compressive sampling theory states that if the sensing matrix  satisfies the restricted isometry property (RIP) of order 2 and the restricted isometry constant (RIC)  2 ≤ √ 2 − 1, then, for all  sparse signals, the solutions of ( 4) and ( 5) are equal as long as the number of measurements satisfies where (Φ, Ψ) stands for the mutual coherence between matrixes Φ and Ψ [21].The commonly used algorithms for solving the compressive sampling problem include the optimization algorithms, the greedy algorithms, and the iterative thresholding algorithms.In the optimization algorithms, a series of convex optimization methods are adopted to search the global optimal solution of the optimization problem (5).
The representative optimization algorithms include the basis pursuit algorithm [22], the gradient projection algorithm [23], and the interior point algorithm [24].In the greedy algorithms, the support set of the signals is selected based on some kind of greedy selection criteria.This category includes the matching pursuit algorithm [25], the orthogonal matching pursuit algorithm [26], the regularized orthogonal matching pursuit algorithm [27], the compressive sampling matching pursuit algorithm [28], and the subspace matching pursuit algorithm [29].In the iterative thresholding algorithms, the optimal sparse solution is obtained through hard or soft thresholding operations in a number of iterations.The typical iterative thresholding algorithms include the iterative hard thresholding algorithm [30], the hard thresholding pursuit algorithm [31], and the soft thresholding algorithm [32].

The Iterative Hard Thresholding Algorithm.
The iterative hard thresholding algorithm belongs to the set of iterative algorithms.Its core idea is to approximate the inverse of the sensing matrix  using the transverse of that matrix; that is, let By multiplying   on the two sides of the constraint condition of the  1 optimization problem (5), we can get Then we can obtain by transposition and thus the following iterative equation: In each round of the above iteration, we choose the largest (in magnitude)  elements of the current iterative vector  and set other elements to zero because the original signal  is  sparse in the orthonormal basis Ψ.The aforementioned selection operation is executed using the following hard thresholding operator: where  Top-|| represents the set of the largest  elements of  in magnitude.Therefore, the iterative equation (9) becomes After a number of iterations, the results of the iterative hard thresholding algorithm will converge to the optimized solution of  1 optimization problem (5).

The System Model
Considering a static WSN in a two-dimensional space as shown in Figure 1 where a number of wireless sensor nodes International Journal of Distributed Sensor Networks are distributed uniformly and randomly in the surveillance area, the agent node in the figure can be either a cluster head or a sink node.We assume that there are  agent nodes and  sensor nodes that have been deployed in the WSN and all agent nodes know their own locations.The agent node which is located closest to the center of the network is appointed as the administrator agent and other agent nodes are appointed as the member agent nodes.
The sensed data in a WSN at a certain point of time constitute a  sparse vector  ∈ R  because there is spatial correlation among the surveillance data in the monitored area.Each agent node  ∈ {1, . . ., } takes   linear measurements   ∈ R   ×1 of  using its subsensing matrix   ∈ R   × .Thus, the set of all agent nodes will equivalently obtain  =  1 + 2 +⋅ ⋅ ⋅+  linear measurements  ∈ R ×1 using global sensing matrix  ∈ R × , where Therefore, the problem of sensed data collection in the set of all agent nodes can be ascribed to the  1 optimization problem (5).By reconstructing the original sensed data vector in a distributed manner, all agent nodes can obtain the sensed data of whole sensor network completely.

The Distributed Data Collection Scheme
The three-phase top-|| query based distributed data collection scheme consists of two correlated algorithms: the distributed iterative hard thresholding algorithm and the three-phase top-|| query algorithm.The first algorithm implements the iterative hard thresholding algorithm in a distributed and cooperative manner while the second algorithm provides the distributed version of the top-|| operator   () in (10) for the first algorithm.

The Distributed Iterative Hard Thresholding Algorithm.
Firstly, the initial value of the original sensed data vector  (0) and the iteration time  at each agent node are set to zero vector 0 and 0, respectively.Then, each agent node  ∈ {1, . . ., } would calculate the intermediate result  ()  using its own subsensing matrix   and measurement projection   by executing in each iteration .In particular, the administrator agent node and every member agent node would calculate  () +   1 ( 1 −  1  () ) and    (  −    () ), respectively.Then, all the agent nodes would execute the three-phase top-|| query algorithm to be described in the next subsection and the administrator agent node would get Finally, the administrator agent node would send the recovered vector  (+1) to each member agent node and all the agent nodes will continue to the next round of iteration.We will show through Theorem 1 that the iterative results (14) in the distributed iterative hard thresholding algorithm are equal to the iterative results (11) in the centralized iterative hard thresholding algorithm.Therefore, after a number of iterations, the intermediate result  (+1) will converge to the optimized solution of the  1 optimization problem (5).
Theorem 1.The intermediate result  (+1) in the centralized iterative hard thresholding algorithm is the same as that in the distributed iterative hard thresholding algorithm.
Proof.According to ( 13) and ( 14), the intermediate result  (+1) in the distributed iterative hard thresholding algorithm is Therefore, the iterative results in the distributed iterative hard thresholding algorithm are equal to that in the centralized iterative hard thresholding algorithm.
The pseudocode of the distributed iterative hard thresholding algorithm is presented in Algorithm 1.

Three-Phase
Top-|| Query Algorithm.The three-phase top-|| query algorithm is proposed to realize the hard thresholding operator   () in a distributed manner.Unlike the classical top- query algorithm which is only suitable to monotonic aggregation functions, our proposed algorithm can perform the largest  absolute elements selection operation which is a nonmonotonic aggregation function in a distributed way.The basic idea of this algorithm is to prune unnecessary elements in the distributed system by only exchanging partial elements among administrator agent node and member agent nodes in three phases.
(1) The initial data set selection phase: in this phase, the initial data set  is selected and distributed among the agent nodes.
(2) The candidate data set selection phase: in this phase, the candidate set  is built by computing the upper bounds and lower bounds of sums for the top-|| elements.
(3) The top-|| elements selection phase: in this phase, the precise sums for the top-|| element are finally determined based on the candidate data set .
In each round of iteration in the distributed iterative hard thresholding algorithm, every agent node  can obtain its own intermediate result  ()   through (13).Before executing the three-phase top-|| query algorithm, every agent node  should sort  ()   by value in a descending order and would get   = {(, V  ()),  = 1, . . .,   }, where  is the element index, V  () is the corresponding element value, and   is the number of elements in agent node .
In the three-phase top-|| query algorithm, the whole sum (), the partial sum (), the upper bound of the whole sum (), and the lower bound of the whole sum () for each element are computed according to the following formulas: ) where   and   are the smallest positive local value and the largest negative local value in   among all elements in the initial data set , respectively.If an element in  does not appear in   , just set it to 0. The partial sum () is used to represent the sum of top-|| elements roughly.The upper bound of the whole sum () and the lower bound of the whole sum () are used to estimate the range of the sum of the top-|| elements.By comparing the partial sum with the upper bound and the lower bound of the whole sum, we can filter out unnecessary elements and construct the candidate set .Thus, the amount of transmitted data among agent nodes can be effectively reduced.Finally, the whole sum () is used to compute the sums of each top-|| element in the aforementioned candidate set .
The three-phase top-|| query algorithm is presented in Algorithm 2 where it is worth noting that the administrator agent node and the member agent nodes all follow the same element selection rules.
The interactive process between administrator agent node and member agent nodes in one iteration of the proposed scheme is shown in Figure 2.Although the three-phase top-|| query algorithm includes three phases, the computation complexity of the algorithm is trivial.By analyzing the algorithm thoroughly, we can see that sorting and summation are the two nontrivial parts in the proposed algorithm.If the number of involved elements in each agent is , then the computational complexities of sorting and summation are (lg) and (), respectively.Therefore, the computational complexity of the three-phase top-|| query algorithm is (lg).The communicational complexity of the three-phase top-|| query algorithm will be analyzed through simulations in Section 5.

An Example of the Three-Phase Top-|𝑘| Query Algorithm.
We now describe the three-phase top-|| query algorithm

Experiment and Analysis
We perform the experiment through simulation to investigate the performance of the proposed three-phase top-|| query based distributed data collection scheme under various conditions.The performance matrices that we have evaluated include data reconstruction accuracy, communication complexity, and computational complexity.The simulation is carried out using Matlab R2011b on a Dell desktop computer with dual core E7500 CPU and 3.0 G memory and involves 50∼200 wireless sensor nodes randomly distributed in a 400 × 400 m 2 area.Multiple uncorrelated two-dimensional Gaussian distribution has been assumed to simulate the spatial correlated data source.The number of agent nodes is 4∼6 International Journal of Distributed Sensor Networks and the Gaussian random matrix is selected as the sensing matrix.Finally, the simulation repeats 500 times and the final results are the averages of the 500 results.

Performance Analysis.
We now analyze the performance of the three-phase top-|| query based distributed data collection scheme.Included in our performance analysis are data reconstruction, the amount of transmitted data, and that of computed data.The amount of transmitted data and that of computed data are the basis for measuring the communication and computational complexity of the scheme in which we count the accumulated amount of transmitted and computed data in all the three phases of the proposed scheme.
Figures 3, 4, and 5 show the signal noise ratio (SNR) of the reconstructed data, the amount of transmitted data, and the amount of computed data of the proposed scheme, respectively, at different data sampling rates.The SNR is computed using the following formula: to measure the data reconstruction error, where  ∈ R  is the original data and x ∈ R  is the reconstructed data.We can draw the conclusion from the figures that the performance on data reconstruction improves along with the increase on the data sampling rate.At the same time, the amount of transmitted data and that of computed data would also increase.The reason is obvious since any increase on the data sampling rate would cause more data to be transmitted and computed while leading to better performance on data reconstruction.Figures 6, 7, and 8 show the SNR of the reconstructed data, the amount of transmitted data, and the amount of computed data of the proposed scheme, respectively, in different surveillance environments.In the experiment, the number of random data sources varies between 2 and 4. We can draw the conclusion from the figures that the performance on data reconstruction of the proposed scheme would degrade when the number of data sources increases.At the same time, the amount of transmitted data and that of computed data would still increase.The reason is that any increase in the number of data sources will cause an increase in the sparsity of the sensed data set in the network, which would degrade the performance on data reconstruction and cause more data to be transmitted and computed in the agent nodes.and 11 show the SNR of the reconstructed data, the amount of transmitted data, and that of computed data of the proposed scheme, respectively, with different numbers of agent nodes.In the experiment, the number of agent nodes varies between 4 and 6.We can see from the figures that the performance on data reconstruction has no obvious difference when the number of agent nodes changes.However, the amount of transmitted data and that of computed data would increase along with the increase in the number of agent nodes.The reason is that the number of agent nodes has little influence on the performance of data reconstruction.However, more data should be transmitted The amount of sensor nodes The amount of agent nodes = 4 The amount of agent nodes = 5 The amount of agent nodes = 6 and computed when more agent nodes are involved in the scheme.

Performance Comparison.
We compare the performance on data reconstruction between our proposed algorithm and two similar schemes, that is, the iterative hard thresholding (IHT) algorithm [30] and the absolute threshold algorithm (ATA) based distributed data recovery algorithm [33], in this subsection.
In the ATA based distributed data recovery algorithm, the top-|| elements are selected based on computing the absolute The amount of sensor nodes The amount of agent nodes = 4 The amount of agent nodes = 5 The amount of agent nodes = 6 thresholds for each element in the list   in a distributed manner.The algorithm processes the sorted lists from both the top and the bottom in one single instance.In each direction, the administrator agent node requests an (index, value) pair from each member agent node  and updates the corresponding top-|| sums it has received so far for agent node .Meanwhile, the administrator agent node also records the largest value   for each agent node .As long as the administrator agent node has received  values which are larger than ∑  =1   , the algorithm terminates.However, multiple interactions among agent nodes are needed while the exact number of interactions cannot be determined in advance.In addition, interactions among agent nodes are not very efficient because only the values that correspond to the same index are transmitted in an interaction.
Figure 12 shows the performances on data reconstruction of the three algorithms in which the number of agent nodes is 5, the number of data sources is 3, and the data sampling rate is 50%.We can see from the figure that the three algorithms exhibit little difference in the performance of data reconstruction.
Figures 13 and 14 show the number of interactions and the amount of transmitted data of the three algorithms, respectively.Since IHT is not a distributed algorithm, all the data in the member agent nodes need to be transmitted to the administrator agent node for processing.Thus, the number of iterations is 1.However, since no data pruning is performed before data are transmitted to the administrator agent node, the amount of transmitted data is significantly more than the other two algorithms.Conversely, since ATA and our proposed algorithm are distributed, multiple interactions are needed.We can see clearly from Figure 13 that the number of interactions in ATA increases with the number of sensor nodes.However, the number of interactions in our proposed algorithm always stays at 3, making it a fixed number which is generally far smaller than that in ATA.Therefore, the time delay of executing our proposed algorithm is less than that of ATA.Meanwhile, we can see from Figure 14 that the total amount of transmitted data of our proposed algorithm is the smallest among three algorithms, thus consuming the least amount of energy among the three algorithms.
In summary, the above experiment and performance analysis show that our proposed three-phase top-|| query based distributed data collection scheme can effectively reduce the amount of transmitted data through limiting the number of interactions while maintaining a similar performance on data reconstruction compared to some comparable schemes.

Conclusion
In this paper, we proposed the three-phase top-|| query based distributed data collection scheme designed for clustered or multisink WSNs.In the scheme, the cluster heads or sink nodes can reconstruct the sensed data of the whole network by collecting local information and exchanging intermediate results among them.The distributed hard thresholding operator is realized by pruning unnecessary elements between the administrator agent node and the member agent nodes in three phases.
The performance on data reconstruction as well as the overhead incurred by the proposed scheme was analyzed through experiment and compared to some existing compressive sampling theory based data collection schemes to demonstrate its clear advantages.The computation and communication overheads of the proposed scheme were also analyzed in detail.Our future research includes the design of the compressive sampling theory based data collection schemes through utilizing the inherited spatial and temporal correlations in the sensed data.

Figure 2 :
Figure 2: The process of interaction.

Figure 3 :
Figure 3: The performance on data reconstruction at different data sampling rates.

Figure 4 :Figure 5 :
Figure 4: The amount of transmitted data at different data sampling rates.
data sources = 2 The amount of data sources = 3 The amount of data sources = 4

Figure 6 :
Figure 6: The performance on data reconstruction with different numbers of data sources.

Figure 7 :
Figure 7: The amount of transmitted data with different numbers of data sources.

Figures 9, 10
Figures 9,10,and 11 show the SNR of the reconstructed data, the amount of transmitted data, and that of computed data of the proposed scheme, respectively, with different numbers of agent nodes.In the experiment, the number of agent nodes varies between 4 and 6.We can see from the figures that the performance on data reconstruction has no obvious difference when the number of agent nodes changes.However, the amount of transmitted data and that of computed data would increase along with the increase in the number of agent nodes.The reason is that the number of agent nodes has little influence on the performance of data reconstruction.However, more data should be transmitted Figures 9,10,and 11 show the SNR of the reconstructed data, the amount of transmitted data, and that of computed data of the proposed scheme, respectively, with different numbers of agent nodes.In the experiment, the number of agent nodes varies between 4 and 6.We can see from the figures that the performance on data reconstruction has no obvious difference when the number of agent nodes changes.However, the amount of transmitted data and that of computed data would increase along with the increase in the number of agent nodes.The reason is that the number of agent nodes has little influence on the performance of data reconstruction.However, more data should be transmitted

4 Figure 8 :
Figure 8: The amount of computed data with different numbers of data sources.

Figure 9 :
Figure 9: The performance on data reconstruction with different numbers of agent nodes.

6 Figure 10 :
Figure 10: The amount of transmitted data with different numbers of agent nodes.

Figure 11 :
Figure 11: The amount of computed data with different numbers of agent nodes.

Figure 12 :Figure 13 :
Figure 12: The comparison of data reconstruction performance.

Figure 14 :
Figure 14: The amount of transmitted data of the three algorithms.

Table 2 :
The computed results.