ID List Forwarding Free Confidentiality Preserving Data Aggregation for Wireless Sensor Networks

Wireless sensor networks (WSNs) are composed of sensor nodes with limited energy which is difficult to replenish. Data aggregation is considered to help reduce communication overhead with in-network processing, thus minimizing energy consumption and maximizing network lifetime. Meanwhile, it comes with challenges for data confidentiality protection. Many existing confidentiality preserving aggregation protocols have to transfer list of sensors' ID for base station to explicitly tell which sensor nodes have actually provided measurement. However, forwarding a large number of node IDs brings overwhelming extra communication overhead. In this paper, we propose provably secure aggregation scheme perturbation-based efficient confidentiality preserving protocol (PEC2P) that allows efficient aggregation of perturbed data without transferring any ID information. In general, environmental data is confined to a certain range; hence, we utilize this feature and design an algorithm to help powerful base station retrieve the ID of reporting nodes. We analyze the accuracy of PEC2P and conclude that base station can retrieve the sum of environmental data with an overwhelming probability. We also prove that PEC2P is CPA secure by security reduction. Experiment results demonstrate that PEC2P significantly reduces node congestion (especially for the root node) during aggregation process in comparison with the existing protocols.


Introduction
Wireless sensor networks (WSNs) integrate microelectromechanical systems (MEMSs) technology, sensor technology, and communication technology.WSN can sense, transport and process different environmental data in its deployment area by hundreds of sensor nodes with limited computation and energy capacities.WSNs have been extensively used in military surveillance, environmental monitoring, production control, and real-time traffic monitoring [1].
Because WSNs are usually deployed in remote, unattended, or even hostile environment, the energy of sensor nodes is not easy to get replenished.Hence, how to reduce energy cost and prolong the network lifetime has become key issues for WSNs [2,3].It is generally believed that power consumption of each sensor node tends to be dominated by data transmission.According to [4], energy cost of transmitting a single bit of data is equivalent to that of 800 instructions.Data aggregation [2,5] mechanisms avoid transmitting environmental data through in-network process of summarizing and combining sensor data, thus reducing the amount of data transmission and effectively maximizing network lifetime.
Data confidentiality [6][7][8][9][10][11] is crucial in many WSN applications, like military surveillance.If data confidentiality is compromised, the sensitive information collected will be leaked to adversary.However, there is a conflict between data aggregation and data confidentiality protocols [12]: data aggregation prefers to operate on plain data and confidentiality protection requires data to be encrypted.Extensive secure data aggregation research [6][7][8][9][10][11][13][14][15] has been conducted.Data aggregation protocols usually cannot operate on encrypted data such that intermediate node has to decrypt packets received from downstream, aggregate the plaintext data with its own, encrypt the aggregated result, and forward to upstream.
Two common approaches to preserve data confidentiality without decryption/encryption are homomorphic International Journal of Distributed Sensor Networks  encryption [7,8,10], and secret perturbation [6,9,11].Homomorphic encryption is an encryption transformation that allows direct computation on encrypted data.However, end-to-end security of symmetric homomorphism [7] is easily compromised if any node is corrupted and the computational cost and communication overhead of asymmetric homomorphism [8,10] are not preferable.In comparison, secret perturbation-based schemes add a perturbation to the value of each reporting node using shared secret key with base station (). retrieves the final aggregation result by removing all these perturbation.Since the key shared between each node and  is unique, adversary will not compute other nodes' sensed data or intermediate aggregation result if one key is compromised.
has to know which sensor nodes have provided measurement before it can correctly remove the perturbations brought by these sensor nodes.A straightforward solution is to require every node participating/not participating in aggregation process to report its ID, according to the proportion of nodes satisfying 's query.
However, this approach may bring high extra overhead.Feng et al. [9] proposed a family of secret perturbation-based schemes that can protect sensor data confidentiality while trying to minimize the number of ID to be transferred.In FSP scheme, every sensor node must reply a perturbed actual or dummy data item, no matter the node has satisfying data or not. will simply subtract hash value for every sensor node to compute final aggregation result, and communication overhead caused by ID transmission is avoided.However, it requires all sensor nodes to report data no matter whether they have data satisfying the query.This may result in high extra communication overhead when only a small number of sensor nodes have data to report and communication overhead caused by extra perturbed data can be much larger than that of forwarding ID.Hence, in their ideal scheme O-ASP, aggregating node first has to compute whether overhead of transmitting ID and perturbed data or overhead of transmitting all perturbed data is larger.Either way, O-ASP endures high communication overhead, and it is unrealistic for each sensor node to know the membership and topology of the whole network, and it knows whether each of these nodes has data satisfying each particular query.
Moreover, the transmission of nodes' ID makes [9,11] not suitable for the scenario shown in Figure 1, where we want to monitor the activities of tanks and battleships, and there is a long path to travel through before aggregation result gets to .To achieve this goal, a cluster or a tree of sensor nodes is deployed in the battlefield, while  is in a secure location away to collect data reported by sensors.All data has to be forwarded on a long path from the targeted area to .For [9,11], ID list is transmitted such that the energy is wasted on the long path, and "single point of failure" could happen if there are not enough nodes on such path.The application scenarios in military surveillance also include the case that US army uses REMBASS to collect data (like ground motion, sound, infrareds, and magnetic fields) and forward the aggregation result to command center.PEC2P fits in this scenario and does not have any requirement on the type of data.
In this paper, we present perturbation-based efficient confidentiality preserving protocol (PEC2P) which can protect data confidentiality without transmitting any ID information.Generally, we use one-way hash function as perturbation added to the environmental data.Since  usually has powerful computational capability in WSNs, we propose to trade computation consumption at  for energy cost of sensors and introduce a new approach for  to compute and tell which nodes have actually sensed data and contributed to the aggregation process after receiving the final aggregation result.Our approach specifically fits for scenarios where aggregation result has to travel a long path before arriving at .In summary, contribution of this paper includes the following.
(1) We draw attention to the ID-list transmitting problem in WSNs and propose the first approach which does not require forwarding any node ID but computing and selecting by .As a result, communication overhead is reduced and reporting nodes' information is further hidden.(2) We avoid using the random number  verified by commonly applied authenticated broadcasting, thus reducing network delay.Instead, we update the secret key of all reporting nodes after each data aggregation to keep indistinguishability from adversary.(3) We prove that our protocol is CPA secure by security reduction.(4) We measure the performance of our protocol through both theoretic analysis and experiments on TinyOS [16].We analyze the accuracy of PEC2P and compare its communication overhead with existing protocols.

Related Work
Girao et al. proposed CDA [7] using symmetric key-based privacy homomorphic encryption.In their approach, sensor nodes share a common symmetric key with the  which is hidden from aggregators, and aggregators can perform aggregation functions directly on the ciphertext instead of carrying out costly decryption and encryption operations.Symmetric homomorphism has the advantage of fast computation.However, secret key is shared among all nodes such that data confidentiality is lost once a sensor node and its shared key are compromised.Mykletun et al. [8] investigated several additive homomorphic public-key encryption schemes and their applicability to WSNs.In general, these schemes preload public key in sensor nodes and aggregate encrypted data.Then  can decrypt aggregation result by its secret key.Albath and Madria [10] proposed an ECC-ElGamal based homomorphic encryption scheme to achieve confidentiality for in-network aggregation in wireless sensor networks.Even if the adversary compromises a node and obtains the public key, it cannot obtain the plaintext of intermediate aggregation results.Hence, public key-based homomorphic encryption schemes are resilient to node compromise attacks.However, the computational cost and communication overhead of public key encryption scheme are not quite tolerable for WSNs, especially when sensors are collecting diverse statistics (like temperature, humidity, and pressure).
Castelluccia et al. [6] first proposed an additively homomorphic encryption scheme which simply adds secret key  to environmental data  as ciphertext  =  + .Each node has a unique secret key such that one node's corruption does not affect the data confidentiality of other nodes.Castelluccia et al. [11] improved their scheme in [11] by proposing a simple and provably secure encryption scheme that allows efficient additive aggregation of encrypted data.Each reporting node  encrypts plaintext data   as:   =   + ℎ(   ()).The security of their scheme is based on the indistinguishability property of a pseudorandom function (PRF).However, IDlist of sensors has to be transferred and cannot be aggregated.
Feng et al. [9] tried to alleviate the ID-list problem and proposed a family of secret perturbation-based schemes that can protect sensor data confidentiality without disrupting the additive data aggregation result.BSP and FSP are two basic schemes which take nonredundant reporting approach and fully reporting approach, respectively.The ideal scheme O-ASP assumes that each sensor node knows the membership and topology of the whole network, and it knows whether each of these nodes has data satisfying each query.Then,  computes aggregation and communication cost of two approaches for each cell before selecting one.To overcome the unrealistic assumption, D-ASP is proposed to enable nodes to make decisions based only on their locally available information, and interactions only take place within a cell or between neighboring cells.However, it is difficult for nodes to decide whether to report their ID with locally available information and it makes no difference when the number of reporting nodes is the same as nonreporting nodes.It also causes extra communication cost and network delay for waiting and deciding.
PRDA [15] pointed out that the transmission of sensor node IDs along with aggregated data packets increases the communication overhead of the network.Therefore, it keeps a table that consists of sensor node IDs and their corresponding small index numbers in each data aggregator.After the cluster forming, data aggregator generates the index table and sends it to .During data aggregation, instead of sending 2byte sensor node IDs, data aggregators send corresponding index numbers. can find the ID of sensor nodes in the index table.However, index numbers are only used within clusters.
Although existing schemes tried to reduce the amount of IDs, they still suffer from related communication cost, and dropping ID or sending false ID will lead  to compute false aggregation result.
Our work requires no ID to be forwarded and achieves a good trade-off between confidentiality and efficiency by adopting perturbation.With this improvement, we manage to simultaneously preserve data confidentiality and significantly reduce overall communication overhead, avoiding high energy consumption in aggregation phase.

System Model
3.1.Network Assumption.We assume a multilevel sensor network tree that consists of  (less than 1000) sensor nodes and certain amount of relay nodes.Sensor nodes are deployed in areas of interest, and they can sense and aggregate data.Both tree and cluster topologies can be applied in aggregation structure.In this paper, we use aggregation tree to illustrate our protocol.Aggregation tree could be formed as in TAG [4].Relay nodes just forward messages, and they consist of a long path from targeted areas to .The powerful  with transmission range covering the whole network is capable of broadcasting messages to all nodes directly.Each sensor node has a unique ID picked from the set {0, 1, . . .,  − 1}.After the aggregation tree is formed, each sensor node monitors its surrounded environment to generate environmental data which is an integer ranging from [V min , V max ].Environmental data (e.g., temperature) can be converted to integers if necessary.Each reporting node and aggregator sends their messages up the aggregation tree.The message has the following format: where  is the number of reporting nodes in network and hax is the sum of environmental data and perturbation.

Design Goals.
When designing confidentiality protection schemes, we aim to achieve the following goals.
(1) Data accuracy:  can correctly retrieve the sum of environmental data with an overwhelming probability.
(2) Data confidentiality: the aggregation result should only be known by  and PEC2P is CPA secure.
(3) Efficiency: the protocol should help to reduce communication overhead and prolong the network lifetime.
Definition 1 (Chosen Plaintext Attack).In this attack, the adversary has the ability to obtain the encryption of plaintexts of its choice.It then attempts to determine the plaintext that was encrypted in some other plaintext [17].
Definition 2 (Negligible Function).A function  is negligible if for every polynomial (⋅), there exists an  such that for all integers  > , it holds that () < 1/().An equivalent formulation of the above is to require that for all constants , there exists an  such that for all  > , it holds that () <  − .
We define an experiment for any private-key encryption scheme Π = (Gen, Enc, Dec), any adversary , and any value  of security parameter.The CPA Indistinguishability Experiment  CPA ,Π ().
(1) A random key  is generated by running Gen().
(2) The adversary  is given input 1  and oracle access to Enc  (⋅), and outputs a pair of messages  0 ,  1 of the same length.
(3) A random bit  ← {0,1} is chosen, and then a ciphertext  ← Enc  (  ) is computed and given to .We call  the challenge ciphertext.
(4)  continues to have oracle access to   (⋅), and outputs a bit   .
Definition 3 (CPA secure).A private-key encryption scheme Π = (Gen; Enc; Dec) has indistinguishable encryptions under a chosen-plaintext attack (or is CPA secure) if for all probabilistic polynomial-time adversaries  there exists a negligible function negl such that: (2)

aAttacker Model.
We assume the existence of a global probabilistic polynomial time (PPT) adversary, which can choose to compromise a small subset of nodes and obtain all secrets of these nodes.With oracle access, it can also obtain the ciphertext for any chosen plaintext from any of the uncompromised nodes.Once the adversary compromises a sensor node, it will obtain its secret key and may modify, forge or discard messages, or simply transmit false aggregation results.
In this paper, we do not consider stealthy attacks [18] where the attacker's goal is to make the  accept false aggregation results while not being detected.Also, we do not consider the denial-of-service (DoS) attack in various protocol layers [19,20] where the adversary prevents the querier from getting any aggregation result at all.However, if a node does not respond to queries, it is clear that something is wrong, and solutions can be implemented to remedy this situation.Sybil/node replication attacks [21] or "wormhole" formation [22,23] are beyond the scope of this paper.

PEC2P
The proposed scheme PEC2P mainly consists of bootstrapping phase, data aggregation phase, and result retrieving phase.
We further assume that  first runs Algorithm 1 such that a unique initialization vector   is generated, and secret key   =   is stored in 's local record and node .

Data Aggregation Phase.
Each sensor node in targeted area may behave as a sensing node, an aggregator, or combined.To simplify the discussion, we assume that each node can perform one role of sensing or aggregating without the loss of generality.Any node with combined role can be logically split into a sensing node and an aggregating node.As shown in Figure 2, aggregator  both senses data and aggregates data from downstream.It is divided into sensing node  0 and aggregating node  1 .After the transformation, only leaf nodes sense environmental data.
In aggregation phase, when a targeted event happens or  disseminates a query, each leaf sensor node  with environmental data   runs Algorithm 2 to compute individual aggregation result ⟨  , ℎ  ⟩.First,  inputs environmental data   , then sets   = 1 and ℎ  =   + (  ) since it has no children nodes.Second,  forwards the result to its parent node for data aggregation.Finally,  updates its secret key   = (  ).Other leaf senor nodes remain hibernated.
During each aggregation, upon receiving a message from one of its children nodes for the first time, each aggregator  starts a timer  and collects other messages before  fires. forwards the result to its parent node.Aggregators receiving no messages from downstream just remain hibernated.Note that we count number to trace the contributing nodes in ; hence, synchronization among sensors is not needed.Definition 4.   is a set of reporting node's ID, and these nodes are node 's children nodes.
To show how our scheme works, we take Figure 2

Accuracy Analysis
Theorem 5. PEC2P has a probability of at least in finding the correct combination in result the retrieving phase, given the environmental data in the range [V min , V max ], the hash value in the range [0, 2  − 1], and modulus  is 2  .
Proof.We assume that { ← {0, 1}  : ()} is the uniform distribution over {0, 1}  , and then () is independent of Thus, adding   environmental data together will result in a number in the range [  * V min ,   * V max ], and the aggregation result is in the range Then, the probability that  accepts a false aggregation result is at most Hence, (4) holds.
If we have 1024 nodes in the network and the data sensed from the environment is in the range [0, 2 32 − 1], we use SHA-1 as our hash function, and the output is in the range [0, 2 160 − 1].We can calculate the probability that  accepts a false aggregation result is 2 −118 which can be ignored.
We have implemented PEC2P using simple WSN experimental system to sense temperature in lab.Characteristics of SimpleWSN node is shown in Table 1.
Results are shown in

Security Analysis.
We assume that each sensor node shares a unique key with  and a common one-way hash function  is used.When an event happens, all nodes which are collecting environmental data will add the hash value computed on () to the environmental data .Intuitively, since key   is only shared between node  and , other node  (  ̸ = ) cannot successfully compute (  ) with the probability  that is not negligible.And it is also difficult for adversaries to compute the correct hash value of any given .Hence, both privacy and confidentiality are achieved.We will prove this by security reduction.First, we construct an encryption scheme (Algorithm 4).
Theorem 7. PEC2P is secure against CPA hash function if the following distributions are to be identical: Proof.Proof for the nonhashed scheme.we assume that adversary  attacks (CPA) PEC2P with success probability (1/2) + ().Now, we can construct a fast algorithm   to "break" Construction Π * , and   tries to achieve its goal by running  as in Algorithm 5.
(3)   forwards the queries to the network and return ((  )) to .

Security of the Hashed Version.
Only a few modifications to this security proof are needed in order to prove the security of the hashed variant.First, in Algorithm 5, all ciphertext are of now generated using the hashed values of .Second, the security proof of the hashed scheme relies on the fact that { ← {0, 1}  : () +  0 } and { ← {0, 1}  : () +  1 } are identical distribution.If  fulfills the requirement, then { ← {0, 1}  : ()} is the uniform distribution over {0, 1}  .Consequently, the two distributions are identical.This thus concludes the proof that the hashed scheme is semantically secure.Thus, PEC2P is CPA secure.

Efficiency Analysis.
For a reporting leaf node, the computational cost only consists of one hash computation and one modular addition.For an aggregator, the computational cost consists of the sum operation of count and sum of perturbed data.If an aggregator has reporting data, it also has one hash computation.
We assume that there are  sensor nodes in reporting area and aggregation tree has a branching factor  of 3. Perturbed data Per = header + data + append.We choose the packet format used in TinyOS [16], and the packet header is 56 bits.Data is in the range of [0, 127].Let count length, ID length, and append length be log 2  bits.We consider two different scenarios: (1) only nodes at the lowest level may have data satisfying 's query and (2) nodes at each level may have data satisfying ' query.
O-ASP [9] is designed based on an ideal and unrealistic assumption that each sensor node knows the membership and topology of the whole network and it knows whether each of these nodes has data satisfying each particular query.In each aggregation, a decision node (say ) first compares the communication cost of [All-reporting] () and [Nonredundant-reporting] () for each cell and then decides which strategy will be chosen.
In Claude 09 [11], in the data aggregation phase, for scenario (1), each reporting node sends (| Per |) bits of message to its parent node, and nodes at second lowest level decide which group if IDs to send: the reporting nodes' IDs or the nonreporting nodes' IDs.For scenario (2), each reporting node will send (|ID| + | Per |) bits of message.
For PEC2P, in the data aggregation phase, for scenarios (1) and ( 2), each reporting node sends (|count| + | Per |) bits of message to its parent node, and the same length of message will also be sent from aggregators.No ID is transmitted in the aggregation tree.
We show the number of bits sent by leaf node in Table 3.Then, we calculate the average/maximum/minimum  communication overhead CO in aggregation phase for Claude 09 and PEC2P in Table 4.In minimum case, reporting nodes are located in the high levels of aggregation tree, and we can find them through breadth-first search.In maximum case, reporting nodes should be located from the lowest level to higher levels.Tables 5 and 6 list the number of bits sent per node for each level with Claude 09 and PEC2P.Figures 3 and 4 show the trend of communication overhead in two different scenarios.
We assume that only the nodes in the lowest level have a probability of (= 0.1, 0.5, 0.9) to sense environmental data.Results are shown in Figures 5(a We further assume that all nodes in aggregation tree has a probability of (= 0.1, 0.5, 0.9) to sense environmental data.Results are shown in Figures 6(a Results show that, compared with existing protocols, PEC2P can greatly reduce communication overhead in aggregation phase.We notice that the major communication overhead is caused by transferring the hash value which was computed by SHA-1 in the comparison.Performance can be further optimized by choosing other hash functions with shorter output in case of lower security level requirement. Result Retrieving Algorithm Test.We used a computer with a Pentium(R) D CPU of 3.40 GHZ and 2.00 GB memory to test Algorithm 7. Since sensor nodes are relatively uniformly distributed and their communication range is from 50 meters to 100 meters, a local event will be detected by a small group of sensor nodes.Therefore, we choose to use a small .Results show that choosing 5 nodes from 10 nodes only needs 8 milliseconds and choosing 10 nodes from 20 nodes only needs  approximately 2 seconds.In WSNs, the capability of  is more powerful than our experimental computer; thus, the searching time will be shorter in real applications.To make the search efficiently, we can first divide the network into clusters of trees.

Conclusion
Confidentiality protection and energy efficiency are two conflict, but equally crucial requirements in WSNs.To achieve a trade-off between these two goals simultaneously, remains a challenge.We propose PEC2P to protect data confidentiality which also achieves energy efficiency.Specifically, we need no ID list and use one-way hash function as perturbation added to the environmental data.Since  usually has powerful computation capacities, we utilize  to the fullest and let it compute which nodes have actually contributed to the aggregation process after receiving the final perturbed aggregation result.Consequently, we manage to preserve data confidentiality, avoid high energy consumption, and obtain lower overall communication overhead.Analysis and experiments have also been conducted to evaluate the proposed protocol.
The results show that our protocol provides confidentiality protection for both raw and aggregated data with an overhead lower than that of the existing related protocols.PEC2P can be adopted to tree/cluster-based aggregation and any protocol using ID-list transmission.We focus on collecting the number of contributing nodes and its perturbed data, instead of how the information is gathered.For uniformity,

Figure 1 :
Figure 1: An example of environmental surveillance system in battlefield.

Figure 5 :
Figure 5: Bandwidth consumption in data aggregation phase when only nodes in the lowest level may have data satisfying ' query.

Figure 6 :
Figure 6: bandwidth consumption in data aggregation phase when nodes at each level may have data satisfying ' query.
as an example.Node  and  are leaf sensor nodes with their own environmental data   and   .Node  is divided into  0 and  1 such that  0 runs Algorithm 2 and node  1 begin   ← ∑ ∈    ; ℎ  ← ∑ ∈  ℎ  mod M; return ⟨  , ℎ  ⟩; end Algorithm 3: Aggregation algorithm.runs Algorithm 3 respectively.Aggregator  just forwards messages after aggregating data received from  and . obtains the final aggregation result:   = 3 and   = 066916.4.3.Result Retrieving Phase.In result retrieving phase, after receiving final aggregation result ⟨  ,   ⟩,  runs Algorithm 7 to retrieve ID list and actual aggregation result.First,  orderly selects a list IDL of   nodes and corresponding shared keys   from the  nodes, and  computes  = (  − ∑ * Define a private-key encryption scheme for messages of length  and key of length  as follows: * / (i) Gen: on input 1  , choose  ← {0, 1}  uniformly at random and output it as the key.(ii) Enc: on input a key  ← {0, 1}  and a message Addition of Ciphertext: given two ciphertext ⟨  ,   ⟩ and ⟨  ,   ⟩, output ⟨  ,   ⟩ as aggregation ciphertext:   =   +     = (  + if  ∈ [  * V min ,   * V max ], and then  will admit that  is the actual aggregation result ∑ ∈   and update secret keys for the found   nodes.If not,  will continue searching.To improve searching efficiency for , we can first divide the network into clusters of trees each containing part of  nodes.Further analysis is in Section 5.3./  ) mod  Algorithm 4: Construction Π * .

Table 1 :
Characteristics of simple WSN node.provided by the SimpleWSN experimental platforms, to transform environment data to floating-point number which represent the Celsius degree.The average temperature is about 30 degrees Celsius in our experiment.The results justified the accuracy of PEC2P such that if we subtract data in column 3 from data in column 2, we will end up with data in column 4. The results verified that both the exact IDs and actual aggregation result are retrieved correctly.
the number of participating nodes  from aggregation result ⟨, ⟩.Column 2 displays the perturbed data  from ⟨, ⟩.Column 3 displays the sum of hash value computed by .Column 4 displays the sum of environmental data after  searching and subtracting the sum of hash value from .The IDs of found nodes are shown in column 5.The temperature sensed is hexadecimal integer.We use Temperature ( ∘ C) = ((/4096) * 1.5 − 0.986)/(0.00355),

Table 2 :
Results of  running selection algorithm after receiving aggregation results.
Proof.If we replace the hash function  in Algorithm 4 with a truely random function , we can have a new construction Π  .It is obvious that

Table 3 :
Number of bits sent per node for leaf node.
ID: node ID; ℎ: header; Per: perturbed data; : number of nodes in network.

Table 5 :
Number of bits sent per node for each level with Claude 09 scheme.
Note: only the nodes in the lowest level may have data satisfying 's query.