The Optimal Noise Distribution for Privacy Preserving in Mobile Aggregation Applications

In emerging mobile aggregation applications (e.g., large-scale mobile survey), individual privacy is a crucial factor to determine the effectiveness, for which the noise-addition method (i.e., a random noise value is added to the true value) is a simple yet powerful approach. However, improper additive noise could result in bias for the aggregate result. It demands an optimal noise distribution to reduce the deviation. In this paper, we develop a mathematical framework to derive the optimal noise distribution that provides privacy protection under the constraint of a limited value deviation. Specifically, we first derive a generic system dynamic function that the optimal noise distribution must satisfy and further investigate two special cases for the distribution of the original value (i.e., Gaussian and truncated Gaussian distribution). Our theoretical and numerical analysis suggests that the Gaussian distribution is the optimal solution for the Gaussian input and the asymptotically optimal solution for the truncated Gaussian input.


Introduction
With the advance of information age, data aggregation has been widely used in daily life and commercial applications. Even some companies such as Canalys make a living from providing all kinds of statistics. In aggregation applications the server wishes to distill valuable aggregate statistics from a mass of individual data. For example, CarTel [1] learns the traffic condition from the road information collected by mobile phones. BikeNet [2] measures air and road condition to guide cyclists, where all the data is contributed by users' devices.
However, the individual privacy may be violated during the aggregation. The server is able to obtain the individual data of participants from inputs. Nevertheless, much of this information is private for individuals, such as health condition and income, especially in the presence of curious server or data abuse. Actually the server only needs to know the aggregate result without knowing the individual data. Thus, in aggregation applications, calculating the aggregate statistics without compromising individual privacy is an important challenge.
Secure Multiparty Computation (SMC) is a good choice to solve this problem. It usually uses cryptographic methods, doing operations on ciphertext domain. However, it also has many limitations. Firstly, because of the huge overhead, SMC is not suitable for large-scale systems. Secondly, most of SMC methods need collaboration of parties, which is not suitable for some circumstances where collaboration is not easy (e.g., in wireless network, the node may not connect with others at all times). Thirdly, both encryption/decryption and communication are high-power consumption operations, which limit SMC deployed in energy-sensitive devices (e.g., sensor or phone). Therefore, SMC is not suitable for the large-scale energy-constraint environments such as largescale mobile survey applications. Furthermore, brute force attack is powerful to cryptographic methods if the plaintext space is small. Some of the methods such as [3] get the aggregate result based on this property. On the contrary, noise addition, which prevents the adversary from getting the accurate individual data values, is a simple but effective method. Compared to SMC, it is much simpler and efficient, especially in this environment. Without collaboration with others, each participant only adds noise into his data 2 International Journal of Distributed Sensor Networks independently before updating. However, in this method, how to choose the noise distribution is a headache. Improper additive noise could result in bias for the aggregate result. It demands an optimal noise distribution which provides the best protection to individual privacy while the aggregate result has tolerable bias. However, the optimal noise distribution is not evident. Usually the noise distribution is proposed directly (usually homogeneous noise or Gaussian noise) without any explanation.
In aggregation applications the accuracy of result and the privacy of individuals are two main concerned issues. Our goal is to find out the optimal noise distribution in noise addition method, where the individual privacy is protected best under the given accuracy requirement. The main contributions of this paper are as follows.
(1) We formulate the accuracy and privacy metrics by mean and variance and mutual information, respectively, which are the foundations for choosing proper noise distribution.
(2) Based on the accuracy and privacy metrics, we develop a mathematical framework to derive the optimal noise distribution.
(3) We get the generic system dynamic function that the optimal noise distribution must satisfy, where the input is the distribution of original individual data.
(4) We solve the problem for two input cases. For the Gaussian input, we get the theoretical optimal solution. For the truncated Gaussian distribution input, firstly we point out when it can be approximated by Gaussian distribution, so that the solution of the Gaussian distribution input can be employed directly. Then for the arbitrary truncated Gaussian input, we point out Gaussian distribution is the asymptotically optimal solution.
The rest of the paper is organized as follows. Related work is introduced in Section 2. We formulate the problem in Section 3. In Section 4 we give the general solution and investigate the Gaussian input. In Section 5 the truncated Gaussian input is analyzed in the details. In Section 6 we numerically verify the conclusions and compare the privacypreserving capability of three proposed noise distributions. At last the paper is concluded in Section 7.

Related Work
SMC enables parties to calculate the result by collaboration based on their own data without compromising others' privacy. However it has lots of limitations. In [4] a secure sum protocol was depicted, where the summation is calculated serially which would spend too much time in large-scale systems. Another protocol [3] was proposed, which allows the untrusted server to calculate the summation. It requires that the sum of the keys of parties is 0. If one of the partiez leaves in the process, which is a common case in large-scale systems, the summation cannot be calculated. Jung et al. [5] proposed a linear time protocol without secure channel, but it still needs lots of communications among parties. Meanwhile, in these methods each party has to communicate with others and do lots of mathematical operations, both of which are high-power consumption operations. Although CPDA [6] reduces the communication overhead of SMC for wireless sensor networks, it is still much more complex than other methods. So SMC is not suitable for energy-constrained devices.
In ad hoc network, some other methods are exploited to protect the privacy during data aggregation while trying to reduce the energy cost. A cryptology-based aggregation approach is proposed in [7], which leverages a simple secure additively homomorphic stream cipher, but it requires that all the nodes must share their keys to the sink node, so that the sink node could decrypt the encrypted aggregate result. In SMART [6], the original data is sliced into several pieces and recombined randomly. This method calculates the summation securely, but the communication cost rises several times. GP2S [8] is based on the data generation. It replaces the original data by an integer range, by which the data collector plots the histogram without the accurate original value. However, the summation calculated by the histogram is not accurate.
Noise addition has been studied for many years in secure data mining [9]. It prevents the adversary from getting the accurate individual data values. Plenty of schemes are proposed to preserve the privacy of individual records. Most of them such as [10,11] are not claimed whether their methods are optimal. Furthermore, they utilize the covariance of data in the database, which needs the party who adds the noise to know the global information of data. In some schemes the noise is added without concerning the covariance of the data, but the uniform distribution or Gaussian distribution is directly declared [12,13]. In [14] the authors considered the optimal randomization given the bias of results, but they did not solve it. Meanwhile, some researchers [12,15] found the original data distribution can be restructured by perturbed values, but the individual privacy is not violated yet. To the best of our knowledge, there is no work completely focusing on the optimization of the noise addition scheme.
There are several different measures of privacy. In [12] the privacy is measured by "confidence interval. " If the data concerned is in the interval ( ) with at least certain probability %, the length of interval | ( )| is treated as a privacy measure. However, this measure is not accurate. Mutual information or differential entropy in Shannon's information theory is another much more popular privacy metric [15]. It indicates the average privacy supporting by mathematical theory. Renyi entropy (an extension of Shannon entropy) is also used to measure privacy [16], but it is too complex and does not have obvious physical meaning.
In recent years differential privacy [17] is a hot noise addition technology protecting the individual privacy in data mining. It guarantees the accuracy of statistical result while avoiding individual record disclosure. Ghosh et al. [18] found out the optimized noise distribution that provides most accurate result under the given privacy requirement. However, differential privacy is against the adversary that obtains individual record from different statistic results.
International Journal of Distributed Sensor Networks 3 In our situation, the adversary can get the individual records directly, and we only focus on one aggregation process.

Problem Formulation
In this section, firstly we introduce some aggregation applications where the violation of individual privacy potentially exists and noise addition method is appropriate. Then we quantify the accuracy and privacy requirements. Finally, based on the measurements the optimization problem is presented.

Applications.
The individual privacy is potentially threatened in statistics aggregation applications. There are many examples, including (i) sensor network aggregation; in sensor network applications, many energy-constrained sensors are widely deployed to monitor the surrounding environment and send data to the central server for aggregation. However, the data from individual sensor may contain privacy-sensitive information, especially if the sensors are deployed in personal space, confidential institution, or across multiple companies. So energy-efficient privacy protection in aggregation is an important issue; (ii) mobile survey applications; in these applications, tens of thousands of participants exist and the phones are energy-constrained. The overall results are distilled from a large amount of individual information collected by mobile phones. However, the individual privacy may be violated during information collection.
In these large-scale energy-constrained applications, the server should know the aggregate results, which are distilled from the information of individuals. However, the individual privacy may be violated during the collection. Noise addition technology, which protects the individual privacy by adding noise into the individual data, is a simple but efficient method in these applications, where it can be employed independently by individual devices without collaboration and the operations are energy-efficient compared with SMC. To describe the problem more accurately, in the following we formulate the problem in mathematical way.

Accuracy and Privacy Measurement.
Suppose that there are users with values , = 1, 2, . . . , , and a server calculating aggregate statistics. In this paper, we mainly focus on a simple but common statistic problem called summation. The server processes the aggregation function sum(V) = ∑ =1 . Of course there are several other aggregation types. Besides summation, Popa et al. [19] list other classes such as average, standard deviation, and count. All of them can be constructed by summation, as outlined in Table 1.
There are two parties threatening the individual privacy. One is the server, who would get the individual data by aggregation. The other is the eavesdropper, who could capture the packets from the participants to the server. Both of Table 1: Aggregation function list.

Aggregation function
Construction with summation sum(V) Count: count(V) The value of each individual is 1 them (named attacker) can get the individual value, which is regarded as the individual privacy.
To protect the individual privacy in the process of aggregating statistics, user adds random noise into his/her true value . Instead of , contributes the perturbed value = + to the server. The information that the attacker knows most is all the perturbed values and the scheme by which the noise is generated. So we suppose the attacker knows and the distributions of and . He tries to get based on the information he knows. The aim of the noise is to prevent the attacker from getting the accurate true value.
Obviously different noise distributions have different privacy protection capability. To protect the true value, how to choose a good noise distribution is the key issue. Noise is a random variable with the probability density function (pdf) . To meet the requirements of accuracy and privacy, should satisfy (1) accuracy requirement; the difference of ∑ and ∑ is small; (2) privacy requirement; the confusion of the true value is evident.
The first requirement guarantees that the aggregate result does not deviate from the true result too much. The second one guarantees the individual privacy is not violated. If the attacker gets the user's value, he still doubts it because of the existence of noise.

Accuracy Measurement.
For accuracy requirement, we define the difference where is the number of participants. Ideally is constantly equal to zero, but it is impossible. Due to the fact that are random variables, also is a random variable, with the expectation ( ) and the variance ( ).
are independent, where has the expectation and the variance 2 , respectively. If they satisfy Lindeberg's condition [20], obeys Gaussian distribution regardless of the distributions of individual noise. It is only decided by the expectation and the variance; that is, ( ) = ∑ =1 and ( ) = ∑ =1 2 . We try to keep small with high probability. It requires ( ) = 0 and ( ) is small. Therefore, we quantify the accuracy requirement as with = 0. measures the average deviation tolerance of the perturbed result from the true result by the variance of the noise distribution. If two zero-mean noise distribution and satisfy < , it means that guarantees the accuracy of result better.

Privacy Measurement. Consider
where , , and are random variables which delegate perturbed value, true value, and noise, respectively. is independent of . Suppose that the adversary knows the distribution of . It is reasonable that any user including malicious user knows it to generate noise. Because of the perturbation of noise , the adversary is uncertain about when he gets . We use Shannon's information entropy to measure the uncertainty. Suppose that the adversary gets = , the uncertainty of is measured by ( | = ) = − ∑ | ( | ) log | ( | ). The larger ( | = ) is, the better the privacy protection is provided at = .
For different , ( | = ) is different. We use the average ( | = ) to quantify the privacy protection strength (denoted by ) of the noise; that is, ( | ) denotes the average uncertainty of the true value when the perturbed value is captured. The larger is, the higher the average uncertainty is.
Generally speaking, for noise addition technology, the accuracy and the privacy are in contradiction. High accuracy leads to low privacy protection strength, and vice versa. However, for a given accuracy level, different usually has different privacy protection capability. Thus how to optimize the noise distribution that provides the best privacy protection under the accuracy constraint is the key problem.

Optimization Problem Formulation.
For convenience, in the following we consider the continuous distributions. The discrete distribution can be regarded as the approximation of the corresponding continuous distribution. Consider the formulation = + , where and are random variables with pdf ( ) and pdf ( ), respectively. We will find the optimal providing the best privacy protection while guaranteeing that the result has an acceptable deviation; that is, max ( ) = ( | ). Consider the optimization problem where 2 is the accuracy requirement bound required by applications. Consider Since ( ) is deterministic, ( ) is a constant. Thus the optimization problem is translated to

Problem Solution
To solve the problem proposed in the above section, firstly we investigate the general solution. Then for the special case that the original data obeys Gaussian distribution, the further result is shown.
Based on the theorem the corresponding system diagram is constructed in Figure 1, where is the input and is the output. The system contains two operations. One is convolution of the input and the output. The other is multiplication of the convolution result and the factor " − 2 − . " From the theorem and system diagram, the optimal noise distribution is determined by the distribution of the original value. Therefore, for different aggregation application, the optimal noise distribution may be different.
( ; ) is a convex function of | ( | ) in problem (9) [21], so problem (8) also is a convex optimization problem. The constraints satisfy the sufficient conditions of KKT approach (inequality constraint is a continuously differentiable convex function; the equality constraints are affine functions [22]). If we find one satisfying (18), it is the global optimal solution. So given the distribution of original value , we only try to find a solution for (18), that is, the optimal noise distribution.

Gaussian Distribution Input. Generally for different input
( ), the output ( ) is different. We consider a special but popular case that follows Gaussian distribution.

Truncated Gaussian Distribution Input
In practice, usually has maximum and minimum bounds. For example, the person's height has a maximum bound and is not less than 0. The examination score usually is in [0, 100]. So we consider the truncated Gaussian distribution in the range [ , ]. In Section 5.1 we investigate the condition that the truncated Gaussian distribution can be approximated by Gaussian distribution, so that Theorem 2 can be applied directly. In Section 5.2 we revise the condition, making it much more accurate. However, not all the truncated Gaussian distribution can be approximated by Gaussian distribution. In Section 5.3, for the arbitrary truncated Gaussian distribution we find Gaussian distribution still is a nearly optimal noise distribution.

Approximation Condition.
We use the metric I( ,̂) [15] to measure the difference of two distributions, where This difference metric measures the overlap of the two distributions, which lies in the interval [0, 1]. The smaller I( ,̂) is, the more overlap the two distributions have, and the more similar they are. I( ,̂) = 0 implies that the two distributions are exactly the same, while I( ,̂) = 1 means there is no overlap between them. Suppose ( ) = (1/ ) (( − )/ ). ( ) is a pdf of the truncated Gaussian distribution over [ , ]; that is, where Φ(⋅) is the cumulative distribution function of the standard normal distribution.
By the analysis in this section, for an aggregation application where the original value obeys the truncated Gaussian distribution, if the distribution satisfies condition (27) or (30), it can be approximated by Gaussian distribution, where Theorem 2 can be used directly.

Approximation Amendment.
From the analysis of Section 5.1, when reaches the minimum, | − | = | − | = (1/2)| − |. Thus the exact min (calculated by (24)) is Equation (29) is the approximate min , from which min /2 = | − |/ . Figure 2 shows the relationship between minimum length of [ , ] and . The smaller is, the larger min /2 is. However, from the figure biases exist between the approximate method and the exact one. The reason is that (29) is based on (26), which contains the approximation that (min /2 ) 2 ≫ log(min /2 ). However, in practice the difference is not large enough, so the bias exists. In order to reduce the bias, we revise the condition. Based on (32) without the above approximation, from (25) we have In (29), min /2 satisfies (min 2 ) Here, (min /2 ) 1 and (min /2 ) 2 are exact value and approximate value, respectively. Consider ∈ [0.0001, 0.05] (which is reasonable that if is too large, the approximation using normal distribution is useless, while if is too small, it is hard to find out such a ( ) in practice), log(2/ 2 ) is in the range The effect of revision is illustrated in Figure 3, where approximate result is almost the same as the exact one.
From Theorem 3 the optimal solution is only determined by ( − ), , and . To see the performance that noise obeys Gaussian distribution, the difference between min ( ; + ) and ( ; + ) is shown in Figure 4, where obeys Gaussian distribution. From the figure, we find that although Gaussian distribution is not the optimal noise distribution, the deviation to the optimal solution is small. In particular, when ( − )/2 is larger than 3, the bias is very close to 0. . Thus although Gaussian distribution is not the optimal noise distribution when follows the truncated Gaussian distribution, it is a nearly optimal noise distribution.

Numerical Simulation
From Theorem 2, when the original value obeys Gaussian distribution, the optimal noise distribution is Gaussian distribution too. Besides Gaussian distribution, homogeneous distribution (e.g., [12]) and Laplace distribution (e.g., [17]) are also used in noise addition method. Figure 5 shows the privacy-preserving capabilities of these three noise distributions, where is a Gaussian distribution. From the figure, we could find that the mutual information, which measures the privacy protection strength, is the smallest with the Gaussian noise. It means that by adding Gaussian noise, the attacker gets the least information of the true value from the perturbed value. Meanwhile, from the figure, the information leaks less with the increase of the accuracy requirement bound 2 . Figure 6 illustrates the privacy-preserving capabilities of Gaussian distribution, homogeneous distribution, and Laplace distribution when is a truncated Gaussian distribution. Here we choose ( ) = ((1/80) ( /80))/ (Φ(300/80) − Φ(−300/80)) as an example, which is in the range of [−300, 300]. From the figure, Gaussian noise is still   the best one to protect the individual privacy. The difference to the optimal noise distribution has been shown in Figure 4, from which we find when the original value obeys the truncated Gaussian distribution, Gaussian distribution is still a good noise distribution.

Conclusion
In this paper, we quantify the accuracy of result and the privacy of individuals. Based on the metrics, we propose the optimization problem, finding out the optimal noise distribution that provides the best privacy protection while maintaining the acceptable deviation from the accurate result. For the special cases that the original data of individuals follows Gaussian distribution and the truncated Gaussian distribution, Gaussian distribution is the optimal distribution and the asymptotically optimal one, respectively.