Unbalanced Threshold Based Distributed Data Collection Scheme in Multisink Wireless Sensor Networks

In multisink wireless sensor networks, synchronized data collection among multiple sinks is a significant and challenging task. In this paper, we propose an unbalanced threshold based distributed data collection scheme to reconstruct the synchronized sensed data of the whole sensor network in all sinks. The proposed scheme includes the unbalanced threshold based distributed top- K query algorithm and the distributed iterative hard thresholding algorithm. By computing unbalanced thresholds and pruning unnecessary element exchanging, each sink can synchronize the top- K aggregated values efficiently via the unbalanced threshold based distributed top- K query algorithm. After that, the synchronized sensed data of the whole sensor network can be reconstructed through the distributed iterative hard thresholding algorithm in a distributed and cooperative manner. We show through experiments that the proposed scheme can reduce the interaction times and decrease the number of transmitted data and that of computed data compared to the existing schemes while maintaining the similar data reconstruction accuracy. The communication and computational performances of the proposed scheme are also analyzed in detail in the paper.


Introduction
Wireless sensor network (WSN) is an autonomous wireless network consisting of a large number of tiny, inexpensive, and spatially distributed wireless sensor nodes.It has emerged as one of the most promising technologies for the future [1].The typical applications of WSN include environment monitoring, industrial process control, intelligent transportation, military surveillance, and health care monitoring [2].
However, wireless sensor nodes have severe resource constraints which include limited computational, memory, communication capacities and nonrenewable energy supply.Therefore, data collection schemes in wireless sensor network should be light weight and energy efficient.Two representative types of data collection schemes in wireless sensor networks are spatial-temporal correlation based data predication schemes [3][4][5] and distributed source coding schemes [6][7][8].
In the first type of data collection scheme, a series of spatialtemporal correlation based data prediction algorithms are adopted to prolong the system lifetime by enabling the sink to predict the sensed data based on some historical samples.
In the second type of data collection scheme, distributed source coding techniques, such as Slepian-Wolf coding and Wyner-Ziv coding, are designed to compress the multiple correlated sensor data without the need of communication among sensor nodes.Nevertheless, the computational and communication cost of the above two types of data collection schemes are still high.Meanwhile, the parameters of predication model and the distributed coded data are still required to be computed and transmitted within the network frequently.
Recently, compressive sensing (CS) theory has attracted considerable attention in areas of signal processing, computer science, and applied mathematics [9].It is based on the principle that a sparse signal can be reconstructed from far fewer samples than that required by the classic Shannon-Nyquist sampling theory.Through nonlinear optimization, we can recover many natural signals which can be represented by a few nonzero coefficients under a suitable basis [10].The inherent characteristics, such as compressibility, robustness, versatility, and stability, have made compressive sensing theory widely applied to communication [11], photography [12], magnetic resonance imaging [13], astronomy [14], and so forth.As for the data collection problem in wireless sensor networks, a number of compressive sensing based schemes have been proposed in the literature.In 2009, Luo et al. proposed the first complete compressive sensing based data gathering (CDG) scheme for large scale wireless sensor networks [15].By adapting a weighted sum of all the readings along the routing tree, CDG scheme can achieve substantial communication cost reduction and balanced energy consumption simultaneously.In order to achieve both energy efficiency and recovery fidelity, Xiang et al. proposed the Compressive Data Aggregation (CDA) scheme in WSN [16].After abstracting the minimum energy compressed data aggregation problem as a NP-completeness problem, they solve it using a mixed integer programming formulation method.Moreover, they also proposed the Dual-lEvel Compressed Aggregation (DECA) framework to recover the in-network aggregated data in the widespread monitoring area by utilizing both the low-rank nature of real-world events and the redundancy in sensed data [17].By exploiting the spatial and temporal correlation among sensed data, Xu et al. proposed the Spatio-Temporal Hierarchical Data Aggregation Scheme Using Compressive Sensing (ST-HDACS) in [18].After collecting the sensed data from a subset of randomly selected sensor nodes through a hierarchical routing structure, the fusion center recovers the whole sensed data by solving a matrix completing problem.Furthermore, Tang et al. proposed a robust compressive data gathering scheme to identify outlying sensor readings and derive the corresponding accurate values and further infer broken links in the sensor network [19].Motivated by the fact that the variance among all the solutions of the blind iterative hard thresholding algorithm with different sparse level can indicate the best sparse level, Xiong and Tang proposed the blind 1-bit compressive sensing reconstruction algorithm for wireless sensor network [20].Similar research works include the sparse random scheduling based data gathering scheme [21], the Resource Efficient Data Gathering (REDG) scheme [22], the Compressive Sensing Based Data Aggregation Scheme (CSDAS) [23], and the distributed multichain based compressive sensing scheme [24].
Although the above schemes can achieve energy-efficient data collection based on compressive sensing theory, they can only be applied to single-sink wireless sensor networks in a centralized manner.In the scenario of multisink wireless sensor networks, compressed data should be transmitted to one specific sink before executing the data reconstruction algorithm.Then, the recovered data should be distributed back to other sinks through unicast or broadcast messages.Obviously, this kind of centralized pattern is less energy efficient and time consuming.In this paper, we focus on designing compressive sensing based distributed data collection scheme for multisink wireless sensor networks.The proposed scheme includes two correlated algorithms: the unbalanced threshold based distributed top-|| algorithm and the distributed iterative hard thresholding algorithm.In the first algorithm, the largest  absolute aggregated values of a series of object-value pairs are queried in a distributed and energy efficient way.By computing unbalanced thresholds, a number of unnecessary object-value pairs exchanging are pruned in the query process.In the second algorithm, the iterative hard thresholding algorithm [25] is modified to fit the distributed environment.Meanwhile, the first algorithm is integrated into the second one as a subroutine to realize the   () operator.Our experiment results indicate that the proposed scheme can reduce the interaction times and decrease the number of transmitted data and that of computed data compared to the existing schemes while maintaining the similar data reconstruction accuracy.
The contributions of this paper are threefold.
(1) We propose an unbalanced threshold based distributed data collection scheme in multisink wireless sensor networks based on compressive sensing theory.By decomposing the task of sparse data reconstruction into several correlated parts, all sinks can obtain the synchronized sensed data of the whole network in a distributed and cooperative manner.
(2) We propose an unbalanced threshold based distributed top-|| query algorithm to query the top-|| aggregated values required by the iterative hard thresholding algorithm in the distributed environment.By adopting unbalanced thresholds, the algorithm can avoid transmitting unnecessary object-value pairs among sinks.Furthermore, the correctness of the proposed algorithm is also proved in this paper.
(3) We analyze the data reconstruction performance, communication complexity, and computational complexity of the proposed scheme through experiments.Furthermore, we also compare the performance of the proposed scheme with existing compressive sensing based data collection schemes.
The rest of this paper is organized as follows.Section 2 introduces the basic principles of compressive sensing theory and the iterative hard thresholding algorithm.Sections 3 and 4 describe the unbalanced threshold based distributed top-|| query algorithm and the distributed iterative hard thresholding algorithm, respectively.Section 5 presents the experimental results and analysis.Finally, Section 6 concludes this paper.

Preliminaries
Compressive sensing theory asserts that a sparse signal can be recovered from fewer linear and nonadaptive measurements than the number of samples required by Shannon-Nyquist Sampling Theorem.The compressive sampling process can be presented as where and matrix  = ΦΨ is usually referred to as sensing matrix.
One of the most important properties of sensing matrix is the Restrict Isometry Property (RIP).Formally, a matrix  is said to satisfy the RIP of order  with a Restrict Isometry Constant (RIC)   such that holds for all  sparse signal .Baraniuk et al. have proved that a random matrix  ∈  × with i.i.d.entries followed by normal distribution (0, 1/) satisfies the RIP with a high probability when  ≥  log(/) and  is a constant [26].
The Basis Pursuit De-Noise (BPDN) [27] and the Least Absolute Shrinkage and Selection Operator (LASSO) [28] are two typical signal recovery frameworks which are wildly used in compressive sensing theory.In BPDN framework, the  1 norm of the unknown signal is minimized under the constraint of limited energy of reconstruction error; that is, In contrast, the energy of reconstruction error is minimized in LASSO framework under the constraint of limited  1 norm of the unknown signal ; that is, Essentially, the BPDN framework and LASSO framework can all be transformed into the following unconstraint regularized optimization framework: A number of data reconstruction algorithms have been proposed in the literature.In general, they can be classified into two categories: optimization algorithms and greedy algorithms.In the optimization algorithms, a number of convex optimization methods have been adopted to search the optimized solution of (6).Basis Pursuit (BP) algorithm [27], Interior Point (IP) algorithm [29], and homotopy algorithm [30] are representatives of this type of reconstruction algorithm.On the contrary, greedy algorithms iteratively select the support sets or the elements of the reconstructed signal based on certain greedy selection strategy.Orthogonal Matching Pursuit (OMP) algorithm [31], Regularized Orthogonal Matching Pursuit (ROMP) algorithm [32], Compressive Sampling Matching Pursuit (CoSmMP) algorithm [33], Subspace Matching Pursuit (SP) algorithm [34], iterative hard thresholding (IHT) algorithm [25], and Iterative Soft Thresholding (IST) algorithm [35] are representatives of greedy algorithm.Among all proposed data reconstruction algorithms, the iterative hard thresholding algorithm is very easy to implement and can be relatively fast.Meanwhile, it also possesses strong theoretical recovery performance guarantees as convex optimization based algorithms.Essentially, it can be regarded as an iterative method where   () is the hard thresholding operator that sets all but  largest elements in magnitude to zero and  is the step size.Formally, we can represent the hard thresholding operator   () as where  top-|| represents the largest  elements of signal  in magnitude.We will propose a distributed implementation of hard thresholding operator   () in next section and integrate it in the distributed iterative hard thresholding algorithm which is described in Section 4.

The Unbalanced Threshold Based Distributed Top-|𝐾| Query Algorithm
In recent years, top- query, also known as ranking aware query, has received much attention in the areas of relational database system, content distribution network, multimedia retrieval system, and so forth.However, only monotonic aggregation functions can be applied to the top- query.The largest  absolute elements query in the hard thresholding operator is nonmonotonic.Therefore, the proposed top- query algorithms, such as thresholding algorithm (TA) [36] and three-phase uniform threshold (TPUT) algorithm [37], cannot be applied to the distributed IHT algorithm directly.We will describe our proposed unbalanced threshold based distributed top-|| query algorithm in detail in Section 3.1 and illustrate it with an example in Section 3.2.

Algorithm Design.
We assume there are  nodes in the distributed system and each node is assigned with an ID.The node with the lowest ID is designated as the administrator node and other nodes are designated as the member nodes.Each node  maintains a descending ordered list   of its object-value pairs (  ,   ()) where index  ranges from 1 to element number   .Note that if an object does not appear in the list, its value is set to zero by default.We will select the largest  sums in magnitude of all the objects in the distributed system.The core idea of the unbalanced threshold based distributed top-|| query algorithm is to filter out unnecessary exchanging of elements among distributed nodes by adopting unbalanced thresholds.It includes three phases, that is, the unbalanced threshold computing phase, the candidate set computing phase, and the top-|| elements computing phase.
In the unbalanced threshold computing phase, each member node sends its first  positive elements and last  negative elements to the administrator node.Then, the administrator node can compute the partial sum () for every received element  according to and select the th largest positive partial sum  1 and the th smallest negative partial sum  In the candidate set computing phase, each member node sends unsent object-value pairs (  ,   ()) to the administrator node when   () ≥  1  or   () ≤  1   .Then, the administrator node computes the partial sum according to (9) and selects the th largest positive partial sum  2 and th smallest negative partial sum  2 once again.Furthermore, it computes the upper bound () and lower bound () of the whole sum for every received object  according to respectively.The upper bound and lower bound are used to estimate the maximal range of each object.If () <  2 or () >  2 , we can guarantee that the whole sum of object is not in the set of top-|| aggregated sums.Hence, we can exclude object  from the candidate set with confidence.
In the top-|| elements computing phase, each member node sends unsent object-value pairs (  ,   ()) in candidate set  to the administrator node.Then, the administrator node can compute the whole sum for each object according to (12) and select the top-|| aggregated sums.
The pseudocode of the unbalanced threshold based distributed top-|| query algorithm is presented in Algorithm 1.
Lemma 1.The unbalanced thresholds computed in phase I can guarantee that the true top-|| objects are among the objects in phase II.
Proof.Assume that   is an object which is not sent to the administrator node in phase II.Then its value   () in node  is less than   and greater than   at the same time; that is,   <   () <   .Then the sum of object   's value can be bounded by Since 13) can be rewritten as That is, object   cannot be a top-|| object.Therefore, true top-|| objects must be among the objects in phase II.this subsection.There are 3 nodes in the distributed system, one administrator node and two member nodes.The objectvalue pairs in each node are presented in Table 1.
The total number of transmitted object-value pairs in our proposed unbalanced threshold based distributed top-|| query algorithm is 11 and twice interactions are required between each member node and administrator node.As for the modified TA based algorithm [38] and the modified TPUT based algorithm [39], the total number of transmitted object-value pairs is 16 and 14, respectively.Meanwhile, the interaction times of these two algorithms are 8 and 3, respectively.We can see that both the interaction times and the number of transmitted data of our proposed scheme are the least among these three algorithms.The performance comparison of these algorithms will be discussed in detail in Section 5.

The Distributed Iterative Hard Thresholding Algorithm
4.1.Network Model.We assume that  sinks and  sensor nodes are deployed randomly and uniformly in the surveillance area in the multisink wireless sensor network.Each sink is assigned with an ID ranging from 1 to  and each sensor node is assigned with an ID ranging from 1 to .The sink with the lowest ID 1 is designated as administrator sink and other sinks are designated as member sinks.We assume that an effective multisink routing protocol, such as the dynamic traffic-aware routing protocol [40] or the scalable gradientbased routing protocol [41], is deployed in the network.Each sensor node can transmit its sensed data to the nearest sink in a multihop manner.The network model is shown in Figure 1.
In wireless sensor networks, the spatial correlation among sensed data in the surveillance area implies that any sensed data  ∈   at a certain time point is of  sparsity.Each sink  takes   samples of  using its subsensing matrix   ∈    × and gets   ∈    ×1 ; that is,   =    +   , where   ∈    ×1 is noise.Hence, the set of  sinks will obtain  = ∑  =1   samples  = [  1 , . . .,    ]  ∈  ×1 using the global sensing matrix  = [  1 , . . .,    ]  ∈  × .Therefore, the data collection and synchronization among multiple sinks can be ascribed to the compressive sensing problem (6).We will design a distributed iterative hard thresholding algorithm to recover the sensed data  at all sinks in a cooperative manner.

Algorithm Design.
After setting the initial value of the sensed data  (0) and iteration time to  × 1 zero vectors 0 and 0, respectively, each sink computes the step size  by where  is the cumulative distribution function of the Tracy-Widom law of order 1, 1 −  is the quantile, and parameters   and   can be computed by According to the random matrix theory, the step size  is tight when the entries of sensing matrix  are drawn from normal distribution (0, 1/) [39].
After that, each sink computes the intermediate result and executes top-|| query by running the unbalanced threshold based distributed top-|| query algorithm.The administrator sink will obtain the ( + 1)th reconstructed result and distribute it to other sinks.The above iteration continues until the difference between two reconstructed results is less than the error threshold .
The pseudocode of the distributed iterative hard thresholding algorithm is presented in Algorithm 2.

Since the distributed top-|𝐾| query
we can assure that the iterative computation in the distributed iterative hard thresholding algorithm equals that in the centralized iterative hard thresholding algorithm (7).Therefore, the recovered sensed data of these two algorithms are identical.

Experiment and Analysis
We evaluate the performance of our proposed unbalanced threshold based distributed data collection scheme in this section.The simulation is carried out using Matlab R2014b and Wislab simulator on a MacBook Pro laptop computer with dual core i7 CPU and 8 G memory.There are 50∼200 wireless sensor nodes and 4∼6 sinks randomly deployed in a 400 × 400 m 2 surveillance area.Multiple uncorrelated twodimensional Gaussian distributions have been superposed to simulate the spatial correlated data sources.Note that the Gaussian random matrix is used as the measurement matrix.We will investigate the data reconstruction accuracy, communication complexity, and computational complexity of our proposed scheme.The signal-to-noise ratio (SNR) is used to measure the data reconstruction accuracy, where  and x are sensed data and reconstructed data, respectively.The number of transmitted data is the number of accumulated data involved in the communication between administrator sink and member sinks.The number of computed data is the number of accumulated data involved in the computation in administrator sink and member sinks.These two indicators are used to measure the communication complexity and computational complexity, respectively.
5.1.Performance Analysis.Figures 2, 3, and 4 show the SNR of reconstructed data, the number of transmitted data, and the number of computed data, respectively, at different data sampling We can conclude that the data reconstruction performance improves with the increase on the data sampling rate.Meanwhile, the number of transmitted data and that of computed data increase too.The reason behind this rule lies in the fact that the increase on the data sampling rate would cause more data to be transmitted and computed.Therefore, the recovered sensed data will be more accurate.Figures 5,6,and 7 show the SNR of reconstructed data, the number of transmitted data, and the number of computed data, respectively, in different surveillance scenarios.The number of uncorrelated two-dimensional Gaussian data sources ranges from 2 to 4. We can conclude that the data reconstruction performance degrades with the increase on the number of uncorrelated data sources.Meanwhile, the number of transmitted data and that of computed data increase at the same time.The reason behind this rule lies in the fact that the increase on the number of uncorrelated data sources leads to the increase on the sparsity of sensed data within the network.Therefore, the data reconstruction performance degrades and more data are required to be transmitted and computed.
Figures 8, 9, and 10 show the SNR of reconstructed data, the number of transmitted data, and the number of computed data, respectively, at different number of sinks.We can conclude that there are no remarkable differences among the data reconstruction performance with different number of sinks.However, the number of transmitted data and that of computed data increase with the increase of the number of sinks.The reason behind this rule lies in the fact that the increase on the number of sinks has no influence on the data reconstruction performance and only requires more data to be transmitted within the network and computed in the sinks.

Performance
Comparison.We will compare the performance of our proposed scheme with the original iterative hard thresholding algorithm [25], the modified thresholding algorithm base iterative hard thresholding scheme [38],  and the modified three-phase uniform threshold based iterative hard thresholding scheme [39] in this subsection.Here, the last two schemes are referred to as the modified TA based DIHT scheme and the modified TPUT based DIHT scheme, respectively.
Figure 11 shows the comparison of the SNR of these four schemes when the data sampling rate is 50%, the number of data sources is 3, and the number of sinks is 5.We can conclude that there are no remarkable differences among the data reconstruction performance of these schemes.In other words, they can provide the similar data recovery accuracy regardless of the centralized or distributed reconstruction pattern.
International Journal of Distributed Sensor Networks Figures 12 and 13 show the number of transmitted data and interaction times of these four schemes in the same experiment settings as in Figure 11.Since the iterative hard thresholding algorithm is a centralized data reconstruction algorithm, all data in the member sinks should be transmitted to the administrator sink for reconstruction and then the recovered data would be resent back to the member sinks for synchronization.Therefore, the number of transmitted data of the IHT algorithm is much higher than that of other three schemes although only one interaction is required.
The interaction times of the modified TA based DIHT scheme dominate that of other three algorithms since only one object-value pair is queried in each interaction during  the top-|| query process.Meanwhile, the number of transmitted data is relatively high since no efficient data pruning techniques are equipped in this algorithm.
Our proposed unbalanced threshold based distributed data collection scheme requires the fewest data transmission among these four schemes.By adopting unbalanced thresholds, it can adjust the query thresholds adaptively and avoid redundant transmissions between administrator sink and member sinks.Sometimes, all elements in the candidate set  can be obtained in advance in phase II instead of in phase III since reasonable thresholds are computed in the unbalanced threshold based distributed top-|| query algorithm.The example illustrated in Section 3.2 has shown this property.Therefore, the interaction times of our proposed scheme are slightly less than that of the modified TPUT based DIHT scheme.The price for the decrease in the number of transmitted data and interaction times is only the computation of unbalanced thresholds, which can be ignored by comparing with the communication energy consumption in wireless sensor nodes.

Conclusion
In this paper, we proposed the unbalanced threshold based distributed data collection scheme designed for multisink wireless sensor networks.All sinks can obtain the synchronized sensed data of the whole network through distributed sparse signal reconstruction.Meanwhile, we also designed a distributed top-|| query algorithm to reduce the number of transmitted data between administrator sink and member sinks by pruning unnecessary elements exchanging.
The data reconstruction accuracy, communication complexity, and computational complexity of the proposed scheme were analyzed in detail through experiments.Furthermore, we also compare the performance of the proposed scheme with existing centralized and distributed data collection schemes in wireless sensor networks.In the future, we would like to design efficient data collection schemes by utilizing the spatial and temporal correlations among sensed data simultaneously.

Figure 2 :Figure 3 :
Figure 2: The SNR of reconstructed data at different data sampling rates.

Figure 4 : 4 Figure 5 :
Figure 4: The number of computed data at different data sampling rates.

4 Figure 6 :
Figure 6: The number of transmitted data at different number of data sources.

Figure 7 :
Figure 7: The number of computed data at different number of data sources.

4 6 Figure 8 : 6 Figure 9 :
Figure 8: The SNR of reconstructed data at different number of sinks.

6 Figure 10 :
Figure 10: The number of computed data at different number of sinks.

Figure 11 :
Figure 11: The SNR of reconstructed data of different schemes.
sensor nodes The number of transmitted data IHT algorithm Modified TA based DIHT scheme Modified TPUT based DIHT scheme The proposed scheme

Figure 12 :
Figure 12: The number of transmitted data of different schemes.

Figure 13 :
Figure 13: The iteration times of different schemes.
∈   is a  sparse unknown signal; that is, there are at most  nonzero elements in .Φ ∈  × is the measurement matrix,  ∈   is the unknown noise, and  ∈   is the measurement signal.Although most natural signals are not sparse directly, they can be transformed into sparse signals.In other words, a signal  can be represented 1.The partial sum () is used to estimate the sum of top-|| elements approximately.In order to deal with unbalanced value distribution among different nodes and avoid redundant element exchanging between administrator node and member nodes, we compute different thresholds for each node instead of fixing one identical threshold for all nodes.Based on the computed positive weight   and negative weight   , the partial sums  1 and  1 are divided into  different thresholds   and   , respectively.

Table 1 :
An example with 3 nodes.