Secure data deduplication for Internet-of-things sensor networks based on threshold dynamic adjustment

Large amount of data are being produce by Internet-of-things sensor networks and applications. Secure and efficient deduplication of Internet-of-things data in the cloud is vital to the prevalence of Internet-of-things applications. In order to ensure data security for deduplication, different data should be assigned with different privacy levels. We propose a deduplication scheme based on threshold dynamic adjustment to ensure the security of data uploading and related operations. The concept of the ideal threshold is introduced for the first time, which can be used to eliminate the drawbacks of the fixed threshold in traditional schemes. The item response theory is adopted to determine the sensitivity of different data and their privacy score, which ensures the applicability of data privacy score. It can solve the problem that some users care little about the privacy issue. We propose a privacy score query and response mechanism based on data encryption. On this basis, the dynamic adjustment method of the popularity threshold is designed for data uploading. Experiment results and analysis show that the proposed scheme based on threshold dynamic adjustment has decent scalability and practicability.


Introduction
With the rapid development of Internet-of-things (IoT) sensor networks and applications, an increasing amount of data are generated and stored in cloud services. Cloud storage not only evolves into a major storage scheme but also provides IoT applications with abundant if not limitless storage capability. Duplicate data are almost unavoidable with thousands of IoT sensors working all day long. 1 Data sharing among users or devices has become a common requirement. These challenges are new to cloud storage providers (CSPs). As the amount of uploaded data increases, so does the extent of data redundancy. Statistics show that up to 60% of the data stored in cloud storage are redundant data, 2 and a large amount of cloud storage resources are consumed, which greatly increases the cost of storage and maintenance of the CSP, especially for IoT-sensor network-based applications. 3 In order to solve the above problems, the CSP generally adopts deduplication technology, 4 which detects identical data objects in the upload stream based on data redundancy. Deduplication enabled systems store only a single copy of the data and create links for other users (or IoT devices) who upload the same data. Deduplication schemes can be classified into block-level 5 and file-level deduplication depending on the size of the objects. Compared with traditional data compression technology, deduplication eliminates not only data redundancy in files but also redundancy between files in shared data sets. 6,7 However, some users are unaware of data security, resulting in a large amount of private data information being shared without the user's consent. In recent years, large-scale data leakage events triggered great concern about privacy issues. 8 Therefore, how to protect user privacy while improving the efficiency of cloud deduplication for IoT applications has become a key issue. 9 Harnik et al. 10 first discussed the security problem of client-side deduplication. Since then, the subject has been extensively explored and it is still under investigation in methodological aspects and concrete applications as well. A deduplication scheme for encrypting upload data is proposed for the first time in Bolosky et al., 11 known as convergent encryption. In this scheme, the hash value of the data is used as the encryption key. However, the direct relation between the key and the data reduces the security of the scheme. In Xu et al., 12 the multi-client cross-deduplication scheme Xu-CDE was first applied for the encrypted ciphertext deduplication problem. 13 The scheme protects the security of private data in the scenario, where external attackers coexist with honest but curious servers. However, in terms of applicability, this scheme has the disadvantages of low encryption efficiency, and it lacks real-time authentication mechanism. In view of the above shortcomings, MRN-CDE (MLE based and random number modified client-side deduplication of encrypted data in cloud storage) was proposed, 14 which applies random number to ensure the instantaneity of the authentication credentials. In order to reduce the amount of computation in the encryption and decryption processes and to ensure data security, the scheme extracts the key from the original data using the KP algorithm 15 in the message locked encryption (MLE) scheme. In addition, some CSP provides users with client-side encryption options, allowing users or IoT sensor networks to encrypt the data before uploading them. This method can effectively protect data privacy. However, even identical plaintext can be encrypted into different ciphertexts by different users, which makes it difficult for the CSP to perform deduplication. Therefore, although the above scheme improves the security of cloud storage, the storage efficiency is still unsatisfying. 16 For the efficiency of deduplication, Stanek et al. proposed a scheme based on popularity partition. Data of different popularity are encrypted with different encryption methods, which can effectively improve the efficiency of deduplication. 17 The scheme assigns a fixed popularity threshold (T) to all data. When the number of copies of certain data in the cloud reaches T, the data are considered to be popular; otherwise, it is regarded as nonpopular data. The cloud server only performs deduplication on popular data, which better protects data privacy while improving the efficiency. The PerfectDedup scheme was proposed by Puzio et al. They used perfect hash function 18 to query the popularity of data with the assistance of a trusted third party. However, the introduction of a trusted third party increases the communication overhead of the CSP, and it causes new security risks. In response to the above problems, Liu et al. 19 proposed a secure deduplication scheme that does not require any third-party server. This scheme uses a password-authenticated key exchange (PAKE) to achieve cross-user key delivery and then cross-user data deduplication. This solution eliminates the dependence on third-party servers and improves its practicability. However, for some popular data, users also need to encrypt them and perform PAKE protocol with other users, resulting in additional computational overhead. Existing popularity-based deduplication schemes do not perform deduplication operations until the total number of data copies reaches the threshold T. 17,18 But in real-world applications, the amount of privacy data that users upload to the cloud is massive, and as a result, nonpopular data also occupy a large amount of storage space in the cloud. To further reserve cloud storage space, deduplication schemes for nonpopular data were proposed, such as the elliptic curve-based encrypted data deduplication schemes in Zhang et al. 20 and Singh et al. 21 The scheme in Zhang et al. 20 adopts an elliptic curve encryption algorithm, which is more secure and less computational intensive. Different encryption methods are used for popular data and nonpopular data. Client-side deduplication is used for popular data, which significantly reduces storage and bandwidth consumption.
In current practical cloud storage applications, the CSP simply sets a fixed threshold for all data uploaded, which leads to many problems. If the threshold is set too high, for data with low privacy, all copies need to be stored before the threshold is reached. If the threshold is set too low, data with higher privacy will be prematurely deduplicated, which may increase the security risk. Therefore, different thresholds should be set according to the privacy level of the data, and the user's understanding of the privacy level should be considered at the same time. For example, a frequently used software installation package should have a low threshold, so that it can be quickly processed for deduplication, which minimizes the storage overhead without compromising user's privacy. And when internal confidential files of a company are uploaded, according to the user's understanding of the privacy level, the CSP could set a relatively large threshold for it, thereby effectively avoiding premature execution of deduplication and better protecting the user data. However, how to recognize the privacy level of each upload data and assign a reasonable threshold for the data according to its privacy level is still a difficult and open problem.
In summary, our work makes the following contribution: We propose a deduplication scheme suitable for IoT sensor networks based on threshold dynamic adjustment to ensure the security of upload data and related operations. We design the threshold dynamic adjustment mechanism, using the item response theory (IRT) to dynamically adjust the threshold. We use the query feedback mechanism to collect the privacy attitude of most users to determine a reasonable threshold for each data uploaded. The concept of ideal threshold is proposed for the first time, which eliminates the disadvantages of unified threshold in traditional schemes.
The remainder of this article is organizes as follows. Section ''System model and design goals'' introduces the system model and design goals. Section ''Preliminary knowledge'' gives the preliminary knowledge of the scheme and its related formulas. Section ''Deduplication scheme'' elaborates on the design of our scheme from three parts: privacy score query, data uploading, and privacy score calculation. Section ''Security analysis'' details security analysis. Section ''Performance evaluation and experiments'' gives the experimental comparison and analysis. Section ''Conclusion'' summarizes the work and draws conclusions.

System model
The system design is based on the IoT sensor network. The system model of this scheme only involves two types of entities: upload user and CSP. Upload user is an abstraction for IoT device, but sometimes it is an actual user in the IoT sensor network. When the system is established, the upload user can interact with the CSP. During the interaction, the upload user can play two roles: data uploader or data observer, and the CSP can only provide data storage and data sharing services for the upload user, without knowing the exact content of the data. The system model is shown in Figure 1.
This model introduces the concept of privacy scores (PR). 22 The PR of the data M is an indicator of privacy risk, and the larger the PR, the higher the privacy of the data. In the IoT sensor network, privacy risk should be measured for data generated by various sensors or devices. In the stage of data uploading, the user sends an upload request to the CSP and calculates the query label of the data using the elliptic curve encryption algorithm. After receiving the upload request, the CSP performs data query and ciphertext comparison, and detects whether M is the first uploaded without leaking the data content. If the CSP finds that the data are stored in the cloud, it returns a suggested PR to the user. The user uploads the encrypted M and its PR rating to the CSP together. After each uploading operation, the CSP adjusts and updates the data's PR to serve as the feedback information for the subsequent upload user. After consecutive upload and adjustment, each data will acquire a corresponded PR that tends to stabilize gradually and meets the anticipation of most upload users. The CSP calculates the popularity threshold T for M based on PR and performs deduplication operation according to the actual value of T. This not only reduces the consumption of storage space but also avoids the leakage of privacy data. In the data observation stage, only the user who has uploaded the data can make a query request to the CSP, and the CSP returns the encrypted data. Furthermore, although we assume that the CSP is honest but curious, it can perform offline analysis to infer additional information. Therefore, the CSP is not trusted from the fact that users do not expect their private information to be known by any third party.

Design goals
In order to better protect data privacy, the scheme should have the following characteristics: Confidentiality of uploaded data: uploaded data require a certain encryption operation. Queryability of privacy scores: when users upload data, they can query and get a reasonable privacy score from the CSP as a reference. Updateability of privacy scores and thresholds: the privacy scores and thresholds of data can be updated in real time according to the specific upload situation.

Preliminary knowledge
IRT and its characteristics IRT 23 is a famous psychological theory, which is often used to analyze the questionnaire results and test data. This theory can infer the probability that the tested user will correctly answer a given question by measuring the ability of the tested user and the difficulty of the specific test item. Moreover, it has been proved that IRT can be applied in cloud computing scenarios. 24 The Rasch model 25 is one of the most common IRT models. It assumes that the probability function of correct response is only related to f i , u i , and g i . The item Q i is represented by a pair of parameters v i ¼ ðu i ; g i Þ. f i represents the capability level of the tested user, u i is the extent to which an item can be distinguished, and g i represents the difficulty of the test item. Because of its few computational parameters and simple structure, the model has the advantage of requiring fewer samples. Invite the user j to answer an item, and if the twovalued ''correct'' or ''error'' is used to indicate the answer to the item Q i , then the probability that the item Q i is correctly answered by the user j is Therefore, IRT has two notable features: Group stableness: the difficulty of the item is a natural attribute of the item, which is independent of the tested user's response. In other words, the parameters of an individual project are not only applicable to users current being tested but also to other types of users. 26 User independence: one tested user will not affect the answer of another tested user to a question, and the answer only depends on the question itself.

General sensitivity calculation method
Generally, the more sensitive the data are, the less likely for the user to disclose it. As shown in equation (2), jx i j (i.e. the number of xði; jÞ ¼ 1) is used to indicate the number of tested users who wish to expose data item i.
Then, the number of tested users who refused to disclose data item i is proportional to g i (the sensitivity of data item i). The more sensitive data item i is, the higher the value of g i where N is the number of data items.

General visibility calculation method
In the case where the answer to the question is a binary value, we usually estimate the probability to calculate the visibility of the data. 27 Assuming that the test project and the tested users are independent of each other, that is, in a test survey, the chance of the tested user answering each question is the same. We can calculate the value of Re ij by multiplying the occurrence probability of 1 in the binary matrix row x i and the occurrence probability of 1 in the column x j . In other words, if jx j j represents the number of Reði; jÞ ¼ 1 data item set by the user j, then the probability Re ij will increase with the increment of user's tendency to share information, and it will also increase with the decrement of the information item sensitivity. The visibility formula is as shown in equation (3) Re where N is the number of data items and n is the number of users under test. The method of calculating the visibility above is to sample all possible response matrices according to the probability distribution statistically. In fact, the visibility is actually calculated by

Data privacy score
The privacy score (PR) of the data is a numerical representation of the overall privacy of the data, which is calculated. The privacy score generated for data i with user j is represented as PRði; jÞ, and its calculation formula is where operator represents any monotone incremental combination of functions about sensitivity and visibility. The details of the calculation process are described in equations (2) and (3).

Bilinear mapping
Let ðG 0 ; + Þ and ðG 1 ; ÁÞ be the addition cycle group of p-order and the multiplication cycle group of p-order, where p is a large prime number and a is a unit element of group G 1 . Z p is the residual class integral ring of module p, and Z Ã p is the set of invertible elements of Z p , defines bilinear mapping e : ðG 0 ; G 0 Þ ! G 1 , and satisfies three properties: 28 9P; Q 2 G 0 making eðP; QÞ 6 ¼ a.

Deduplication scheme
Overview When a user uploads encrypted data to the CSP, the CSP can check whether the data have already been stored using the query label generated by elliptic curve technique and return a suggested PR to the user according to data information in the database if it exists. Eventually, the user uploads the data and its privacy score to the CSP, and the CSP reaggregates the privacy score and updates the threshold for it. Based on the threshold and the current number of data holders, the CSP decides whether to store it or not. The scheme consists of three parts: (1) privacy score query, (2) data uploading, and (3) privacy score calculation and threshold update, as shown in Figure 2.

Privacy score query
When a user uploads data to the CSP, the user can query the current PR of the upload data. The contextbased privacy score query method is an optional solution. 24 Context refers to a summary and a general idea generated from a long list of words or content, and even further, it is a representative keyword combination sorted out from the entire upload data, the general form is (field1 = value1, field2 = value2, ...). For example, if an upload data contains ''Bob shared the final exam score in the class group with a desktop computer,'' the context in the above example would be (sharer = Bob, subject = final exam score, observer = classmate).
In Harkous et al., 24 a scheme of data privacy query using context is introduced in detail. Before querying the degree of privacy, the user forms a query set of multiple virtual contexts and sends them to the CSP, to hide the actual request context. The context-based privacy score query only needs to replace the data privacy with the corresponding privacy score based on the above scheme. When the user queries the CSP for the PR of the data by means of the context, he should avoid sending the real context directly to the CSP, to prevent the CSP from associating the data sharing operation with the context, which reduces the risk of privacy data leakage.
Context-based privacy score query is easy to implement, but we usually assume that the CSP is honest but curious, and it can obtain specific data information uploaded by users through offline analysis and other operations. Therefore, we adopt the privacy score query mechanism of encrypted data. Based on the existing research results of our team, this mechanism uses the elliptic curve-based file label query scheme 20 to facilitate the privacy score query. In Zhang et al., 20 a popularity query protocol without online trusted third parties is proposed. By constructing a bilinear map query label s j ¼ eðY i ; HðM j ÞÞ X 1 (Y i is the encryption public key, X 1 and X 2 are the auxiliary keys, and X 3 ¼ X 1 + X 2 ), the data popularity check can be quickly completed using label comparison without data privacy disclosure. By associating the privacy score with the query label, we can use the query label comparison method to achieve a quick query of the privacy score and avoid the lack of privacy protection that may be caused by the contextbased privacy score query method.

Popularity threshold
In order to improve the efficiency of deduplication, the CSP assigns a popularity threshold T to the upload data. When the total number of upload users of a certain data M is greater than the T, we consider M to be a popular data. We use the more efficient convergence encryption and perform deduplication operations on it; otherwise, we consider M to be nonpopular, which has a high degree of privacy. In that case, M needs to be protected by semantically secure symmetric encryption.

Data uploading procedure
When the U j upload data M i , U j first uploads the query label s i to the CSP. The CSP divides the file upload operation into three cases according to the relationship between the total number of users currently uploading the file countðU M i Þ and the dynamic threshold T i : countðU M i Þ\T i , countðU M i Þ ¼ T i , and countðU M i Þ.T i , as shown in Figure 3.
When countðU M i Þ\T i , the CSP checks if the same data already exists. If it is the first uploaded copy, U j encrypts the data and uploads it to the CSP. The CSP stores the privacy score PR ij , the encrypted encryption key K 0 M i ¼ EðX 3 ; K M i À HðM i ÞÞ, and the ciphertext c K i ¼ EðK M i ; M i Þ, as shown in part (a) of the figure; if it is not the first uploaded copy, the CSP returns the symmetric encryption key K 0 M i and the current PR i to U j . U j decrypts and obtains K M i and encrypts M i with it. Then, U j uploads the ciphertext c K i ¼ EðK M i ; M i Þ and the privacy score PR ij to the CSP. The CSP updates the PR i and deletes the newly upload data and create access links for U j , as shown in part (b) of the figure. When countðU M i Þ ¼ T i , the CSP informs the U j to upload the convergence encryption ciphertext c X i ¼ EðX i ; M i Þ and stores it, where X i ¼ HðM i Þ + X 2 . When countðU M i Þ.T i , the CSP tells U j to perform client-side deduplication and creates an access link for U j .

Privacy score calculation
In this section, we design the privacy score calculation method based on IRT, in order to ensure that the data's PR meets the requirements of most upload users.
Because different upload data M i are independent of each other, the parameters v i ¼ ðu i ; g i Þ for PR can be calculated independently, so the calculation of the PR can be performed in parallel. The IRT-based PR calculation still needs to use formula (1) to estimate the probability Re ij ¼ probfxði; jÞ ¼ 1g. In the cloud storage environment, we treat upload user U j as the tested user and upload data M i as the information item. Thus, the user's propensity to privacy corresponds to the ability of himself: for U j , we use the privacy preference parameter f i to evaluate the degree to which U j cares about privacy. The higher the f i value, the more extroverted or open U j is. 29 If we assume that all the problems are easy to understand and have the similar degree of discrimination, then u i is no longer a variable to consider, and it can be replaced by a constant or simply ignored. Finally, we use the difficulty parameters g i of the problem to represent the sensitivity of data, g I ø 0.
For each parameter v i ¼ ðu i ; g i Þ of the upload data, we can estimate it by the maximum likelihood function, as shown in the formula (5) In other words, we need to search for the data parameters v i ¼ ðu i ; g i Þ that maximize the result using the likelihood function. N is the total number of users uploading the same data, and xði; jÞ is the evaluation of the privacy tendency of the data M i by U j , xði; jÞ 2 ð0; 1Þ. Specifically, when user U j uploads M i , the evaluation parameter xði; jÞ of the privacy tendency is uploaded together to represent the evaluation of the privacy level of the uploaded data. When the sensitivity g i is sought, the value of x i; j ð Þ is used as the privacy tendency parameter f i . The value is finally obtained as the sensitivity v i ¼ ðu i ; g i Þ based on the privacy tendency parameter f i being a known amount.
Similarly, on the basis that the sensitivity g i is a known amount, the NR_Attitude_Estimation algorithm is used to search for the likelihood function or its corresponding log likelihood function, and the maximum privacy tendency parameter f i is found.
Finally, we integrate it through formula (6) and calculate the PR that meets the public's wishes 22 At the same time, current PR of the data is represented by pr i ¼ PRðiÞ=n, where pr i 2 ½0; 1. Finally, the CSP updates the data threshold with formula (7) according to the change of privacy score after each upload As the number of upload users increases, the threshold changes and gradually approaches lim i!' a negligible value). Here, we propose the concept of ideal threshold T 8. T8 is a threshold value per data that meets the user's wishes. As the number of users increases, the dynamic threshold T i will gradually approach T 8, that is, lim i!' T i À T 8 j jł s. In summary, the IRT-based PR calculation has the following advantages: Because different upload data are independent of each other, the CSP can compute the PR in parallel, so the scheme is efficient and of practical feasibility. The parameters used in the privacy score calculation based on IRT are estimated by likelihood function which satisfies the property requirement of group invariance. This makes the PR corresponding to different uploaded data directly comparable.

Security analysis
The scheme is designed to better protect the security of private data through threshold dynamic adjustability. Our proposed deduplication scheme makes it impossible for the CSP to be spoofed. The data could be obtained only by obtaining the query label of the data.
Here, we mainly discuss the authenticity and differentiability of the query labels. The security theorem is as follows.
Lemma 1. For a safe hash function H : f0; 1g Ã ! G 1 , the probability of 8D 1 ; D 2 2 ½0; 1 Ã , Theorem 1. Authenticity of data query labels. Let the initial upload user U 0 upload data M, then the query label of M is s M ¼ eðY ; HðMÞÞ X . When user U j uploads data M 0 , the query label of M 0 is s M 0 ¼ eðY ; HðM 0 ÞÞ X . If and Proof. If M 6 ¼ M 0 , we discuss the attacking of data query labels from the following two aspects: 1. Assuming that adversary A is a malicious user, then A can get parameter X.

Assuming that adversary A is CSP, parameters
HðMÞ and X are unpredictable for A.
In both cases, adversary A cannot construct a query label satisfying equation s M ¼ s M 0 . According to Lemma 1 That is, P½M 6 ¼ M 0 jeðY ; HðMÞÞ X ¼ eðY ; HðM 0 ÞÞ X ł e. Therefore, the query label of the data cannot be forged. As s M ¼ s M 0 , it can be deduced that M ¼ M 0 .
Theorem 2. Differentiability of query labels. Let the initial upload user U 0 of data M upload the query label as s M ¼ eðY ; HðMÞÞ X . When user U j uploads M 0 , the upload query label as s M 0 ¼ eðY ; HðM 0 ÞÞ X . If M 6 ¼ M 0 , the probability of s F ¼ s F 0 can be ignored. That is, According to the properties of bilinear mappings, an equation can be obtained If the above formula is valid, HðMÞ ¼ HðM 0 Þ must be valid. According to Lemma 1, M ¼ M 0 can be obtained, which is contradictory to the hypothesis. Therefore, the hypothesis is not valid. That is, P½s M ¼ s M 0 jM 6 ¼ M 0 ł e. It can be seen that when and only when s M ¼ s M 0 , M ¼ M 0 holds. That is to say, the query labels of data are distinguishable.
Lemma 2. Compute Diffie-Hellman (CDH) problem. Suppose ðG 0 ; ÁÞ is a multiplicative cyclic group of order P, and the unit element is denoted as g. For a given g; g a ; h 2 G 0 , it is difficult to calculate Q ¼ h a 2 G 0 , where a 2 Z Ã n .
Theorem 3. Security of data labels. CSP cannot attack the query label offline and get any plaintext information.
Proof. Let CSP execute an offline brute force attack on the query label s M ¼ eðY ; HðMÞÞ X of data. CSP exhausts a large amount of data fM i g, trying to find data M 0 that satisfies s M ¼ s M 0 . CSP can calculate eðY ; HðM 1 ÞÞ, but because it is not an authorized user, it is impossible to obtain the security parameter X from the broadcasting center. Lemma 2 shows that it is still difficult to calculate eðY ; HðM 1 ÞÞ X even if eðY ; HðMÞÞ X and eðY ; HðM 1 ÞÞ are known. Therefore, CSP cannot obtain explicit data information from query labels by brute force attack.
In addition, we also considered the situation of malicious scoring. We made simulations, and detailed results can be found in the next section.

Performance evaluation and experiments
The experiment uses PBC, 30 GMP, 31 PBC_bce, 32 and OPENSSL 33 function libraries, which are implemented by C + + language. It is deployed to a Tencent's cloud storage server, which is equipped with 4 GB memory, 4-core CPU, 1 Mbps bandwidth, and 1 TB storage. In order to make it easier for users to understand and to operate, our scheme adopts a more user friendly design in the implementation of PR. When the upload user scores some data for the PR, it is only necessary to choose a value between 1 and 100 as the score. The system automatically converts the data and updates the PR and T. In accordance with the percentile scoring habit, users can understand the privacy of upload data more intuitively. Considering the difficulty of sample selection, we generate random numbers to simulate the PR of different users on a certain upload data.

Data set
In view of the problem that different data have different ideal thresholds, we carry out a comparative experiment on the overall PR and T with various of data. We use three sets of random numbers to simulate the PR of different data. Each data set consists of 100 random numbers. The first data set consists of random numbers from 80 to 100, which simulates the user's PR for data with a high privacy level. The second data set consists of random numbers from 0 to 20, which simulates the user's privacy score of certain data with a low degree of privacy. The third data set is composed of random numbers from 1 to 100, which simulates the random distribution of upload users' privacy scores on given data.
In the performance comparison experiment, we choose 1000 files of 10 MB as the upload data, in which the ratio of data with lower privacy to data with higher privacy is about 3/2. Other schemes adopt the unified popularity threshold and set it to T ¼ 7.

Experimental analysis
The data of the above three groups of experimental data sets are simulated by uploading and dynamic adjusting threshold, respectively, and the changes of the whole privacy score and T value of the data are compared and analyzed. Figures 4-6 are derived from the data set with the interval of (80-100), where Figure 4 shows the change of the PR with the number of upload users. The curve in Figure 4 is connected by 100 data points. The ordinate of each point is the result of the PR adjustment according to the feedback of the user. The horizontal coordinate is the number of users currently uploading data, and Figure 5 shows the relationship between the dynamic adjustment value of T and the PR. The curve in the figure shows how T of the data changes with the PR. The meaning of the ordinate of each point in the figure is the same as that of Figure 4, and the abscissa is the dynamic threshold of the data calculated according to the privacy fraction. The dynamic threshold T of the data is calculated as in equation (7). T is proportional to the PR of the data, where a value can be adjusted as needed. In this experiment, a ¼ 7; Figure 6 is the result of overlapping of Figures 4 and 5 in the same coordinate system. The minimum abscissa value in all intersections is the T for actual deduplication. However, there is no intersection in Figure 6, which means there is no deduplication operation. Similarly, Figures 7-9 are derived from the data set with the interval (0-20). In this simulated scenario, the actual deduplication threshold is T ¼ 10. Figures 10-12 are derived from the data set with the interval (1-100), and the value of T at the actual deduplication is 31.
In Figure 4, all users are uploading data with low degree of privacy, but there are still small differences in the specific case in the value of numerical PR. We assume that each user will eventually choose a number from (80-100) as its PR. When the number of upload users is small, the PR chosen by the user has a greater impact on the overall PR. As the number of upload users increases, the impact of the PR chosen by a single user on the overall PR decreases. Finally, the PR is stable at about 90. Similarly, Figure 7 represents a situation where the overall PR is high, and it is stable at about 11. The curve in Figure 10 represents the PR of all users for some data which are not unified. From the curve in the figure, it can be seen that under the premise of inconsistent user opinions, when the number of upload user is small, the privacy score adjustment fluctuates greatly. As the number of upload users increases, the impact gets smaller, and the overall PR of the data will eventually be stabilized.
In Figures 4, 7, and 10, when the number of samples is large enough, T tends to stabilize as the number of   users increases. This value is only related to the nature of the data and the user's concern for privacy. The results in Figure 10 further illustrate that the final PR of data is determined by the user's attitude. In the case of a large user population, the attitude of individual users (reflected by the score) has little impact on it. As can be seen from Figure 6, when the privacy level of data information is high, there is no intersection point between the two curves, that is, the data are not subjected to the deduplication operation, which protects the security of private data more effectively. According to Figure 9, when the data privacy level is very low, the threshold for actually deduplication is very small, which can effectively reserve cloud storage space. From Figures 10-12, it can be seen that when some upload user has a different attitude toward the privacy of certain data, the system will determine an appropriate threshold for the data according to the attitude of the majority. Therefore, on the whole, the scheme is feasible and applicable.

Anti-abuse experiment
The anti-abuse experiment mainly tests the anti-abuse capability of the scheme from the aspect of malicious scoring. We assume that a malicious user deliberately    sets the privacy score to 100 when the user privacy score is generally low ((0-20) or sets the privacy score to 0 when the user privacy score is generally high (80-100)) We tested the impact of malicious user scoring in three different settings, in which there are 2%, 3%, and 4% malicious users, respectively. The experimental results are shown in Figures 13 and 14. Figure 13 reflects the situation where some malicious user intentionally chooses a larger privacy score when uploading data with a lower privacy score. Figure 14 reflects the situation where some malicious user intentionally chooses a smaller privacy score when uploading data with a higher privacy score. To facilitate better observation, we assume that malicious users send their own malicious ratings at the beginning. As can be seen from the Figures 13 and 14, the higher the proportion of malicious users is, the greater the impact on the privacy score is. The greater the population of upload users is, the minor affect the malicious score causes. When the number of upload user exceeds 70, the four curves in the figure have little difference, which indicates that the scheme has anti-abuse capability.

Performance comparison
By uploading 1000 files of 10 MB, the total time consumption of our scheme is calculated and compared    with that of other schemes, namely the PerfectDedup scheme, the common popularity threshold-based deduplication scheme, and the Xu-CDE scheme. The experiment is repeated for 10 times and the average result is acquired as the final result, which is shown in Figure 15. In the data encryption phase, the time consumption of the four schemes is similar. In the query stage, our scheme has advantages over other schemes that recognize data popularity. Finally, compared with other schemes, our design does not cause additional time overhead while improving the security of the deduplication operation.

Conclusion
In this article, we address the issue of deduplication threshold in the cloud storage scenario and propose a secure deduplication scheme for IoT sensor networks based on threshold dynamic adjustment. The concept of the ideal threshold is proposed for the first time, and the IRT is applied. By uploading the user's feedback on the data privacy level, the privacy score can be dynamically adjusted, thereby calculating and adjusting the threshold of the deduplication. This scheme can speed up the deduplication stage for data with lower privacy, while data with a higher extent of privacy can be better protected. Experiments show that our scheme not only improves the security of deduplication operation but also avoids additional time overhead. Compared with other schemes, our scheme is more practical for realworld applications.
How to improve the efficiency of data deduplication for IoT applications while ensuring data security will be studied in future works.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.