A reversible database watermarking method non-redundancy shifting-based histogram gaps

In relational databases, embedding watermarks in integer data using traditional histogram shifting method has the problem of large data distortion. To solve this problem, a reversible database watermarking method without redundant shifting distortion is proposed, taking advantage of a large number of gaps in the integer histogram. This method embeds the watermark bit by bit on the basis of grouping. First, an integer data histogram is constructed with the absolute value of the prediction error of the data as a variable. Second, the positional relationship between each column and the gap in the histogram is analyzed to find out all the columns adjacent to the gap. Third, the highest column is selected as the embedded point. Finally, a watermark bit is embedded on the group by the histogram non-redundant shifting method. Experimental results show that compared with existing reversible database watermarking methods, such as genetic algo-rithm and histogram shift watermarking and histogram gap–based watermarking, the proposed method has no data distortion caused by the shifting redundant histogram columns after embedding watermarks on forest cover type data set and effectively reduces the data distortion rate after embedding watermarks.


Introduction
In the era of big data, how to protect data security in the database is an important issue. Therefore, researchers introduced digital watermarking technology [1][2][3][4] into the database as an effective means and then the database watermarking technology 5,6 came into being. The technology protects the rights of the database owner by modifying the data in the database and embedding the watermark containing the copyright information. In 2002, IBM Almaden research center introduced digital watermarking technology in the relational database for the first time. 7 Thereafter, a variety of database watermarking technologies successively appeared, such as watermark technology based on special marker tuples, 8 fingerprint technology based on block idea, 9 and optimization technology based on verification. 10 In order to further protect the authenticity and integrity of data, reversible database watermarking technology emerges as the times require. 11,12 This technique can not only hide the watermark information in the original carrier data but also recover the original carrier data without loss after extracting the watermark. In 2006, a reversible database watermarking method histogram shifting watermarking (HSW) based on histogram 1 State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China shifting was first proposed by Zhang et al. 13 Since then, researchers have successively proposed different types of reversible watermarking technologies, such as the difference expansion-based watermarking (DEW), 14 the differential expansion watermark technology combined with genetic algorithm (GADEW), 15 the differential expansion watermark technology combined with firefly algorithm (FFADEW), 16 and the predictive error extended watermarking (PEEW). 17 Although these techniques have solved the problems of copyright protection and data recovery in different degrees, they have some shortcomings in data distortion rate (DDR) and watermark robustness.
In 2018, DH Hu et al. 18 proposed a reversible database watermarking method called genetic algorithm and histogram shift watermarking (GAHSW) based on numerical relational data distortion control. This method uses a genetic algorithm to group tuples and then uses the HSW method to embed watermarks. This method improves the robustness of reversible database watermarking while reducing data distortion. Li et al. 19 proposed a low-distortion reversible database watermarking method called histogram gap-based watermarking (HGW). Compared with GAHSW, this method reduces data distortion without weakening the robustness of watermark.
However, through the analysis, we found that the HGW is unnecessary to modify part of the data when embedding watermarks, which is caused by the shifting of redundant columns. Therefore, we propose a reversible database watermarking method without redundant shifting distortion, which is abbreviated as NSHGW (nonshifting histogram gaps based on watermarking). The traditional histogram shifting method is improved using the characteristics of a large number of gaps in the database histogram, the positional relationship between each column and gap in the histogram is analyzed, and all adjacent columns are found; then choose the highest one of these columns as the embedding point to embed watermark information, so as to reduce the amount of modification to the carrier database data, reduce the redundant shifting of histogram columns, so as to reduce data distortion.
The structure of the article is as follows: Section ''Related work'' describes the related work in the early stage of the paper; section ''The proposed method'' elaborates the basic ideas and main contents of the proposed method; experimental verification and result analysis were carried out in section ''Results and analysis''; and section ''Conclusion'' gives conclusions.

Related work
In this section, histogram shifting method and HGW reversible database watermarking method will be introduced in detail.

Histogram shifting method
The histogram shifting method (HS, histogram shifting) is mainly used in the early stage to process the reversible watermarking of digital images, which hides the watermarking information with the vertex value of the histogram. In Zhang et al.'s study, 13 the histogram shifting method was introduced into the database. This method obtains non-zero prediction errors from two adjacent local raw values in the database, uses these error values as the horizontal axis, and uses the frequency of each value of prediction error as the vertical axis to construct a histogram. The prediction errors are calculated by equation (1.1), where p e is the prediction error, and x i , x i + 1 are two adjacent attribute values A non-zero frequency peak bit is found in the histogram, and the corresponding prediction error is marked as p. Then shift all the bits except p by one cell to create a vacancy near p. Finally, each prediction error is scanned one by one, and when p will be encountered, a watermark bit w(w = 0 or w = 1) is embedded. After embedding the watermark, the prediction error changes and the new prediction error value p 0 e is calculated by equation ( i and x 0 i + 1 are the watermark values of attributes x i and x i + 1 after embedding, and x m is the mean value of x i and x i + 1 , which are calculated by equation (1.5) :3Þ For example, x i = 102 and x i + 1 =108 are adjacent properties, prediction error p e =108À102=6, x m = (102+108)=2 b c = 105. If the peak bit p=4, embedded watermark bit w=1, then the new prediction error value p 0 e = 4 + 1 = 5, the values of x m and p 0 e are respectively substituted into equations (1.3) and (1.4) to obtain:

HGW reversible database watermarking method
Li et al. 19 propose a low-distortion reversible database watermarking method which is HGW. This method improves the traditional histogram shifting method. It uses histogram gaps to reduce the number of columns shifted when embedding a watermark to reduce data distortion. HGW first divides the database tuples into N g groups according to the watermark length N g and then inserts a watermark bit into each group by histogram translation method. The basic steps of HGW embedding watermark are as follows: Step 1. Determine the relevant parameters: for the first i grouping, found in the histogram peak location p i , were about to search, to find the first frequency to zero position, respectively for p iL , p iR , and p iL to p i all column height of Hutchison for hs iL and between p i and p iR all column height of Hutchison for hs iR , then respectively between p i and p iL , p iR , the number of columns for d iL , d iR . The position relationship between p i , p iL , and p iR is shown in Figure  1.
Step 2. Determining the shifting direction: this method judges the sizes of hs iL and hs iR and distinguishes between different cases for shifting and watermark embedding according to the judgment results: if hs iL ø hs iR According to the reference, when p e ø 0, p h = p e ; when p e \0, p h = À p e ; during watermark embedding, we should consider not only the conditional comparison of the sum of heights but also the distinction between p e ø 0 and p e \0. Finally, histogram translation and watermark embedding will be processed according to the differential formula.

The proposed method
The proposed NSHGW mainly consists of three sections: preprocessing, watermark embedding, and watermark extraction. This method is applicable to the integer numerical data of relational database. The algorithm principle is shown in Figure 2. Among them, D O is the original database, D W is the database with embedded watermarks, D AW is the watermarked database after suffered attacks, and D R is the recovered database.

Preprocessing
In the preprocessing stage, the step should be completed before watermark embedding and watermark extraction. The specific pretreatment process is as follows: Step 1.1 Determining candidate attribute columns: selecting multiple integer columns from the database that can identify the features of things as Candidate attribute columns for embedding watermarks. The data in the candidate attribute column usually indicate the characteristics of things (such as hair color). When watermarks are embedded in these data, the data distortion caused by the data has less impact on the data quality.
Step 1.2. Sorting candidate attribute columns: This method sorts the column names of the candidate attribute columns in ascending order, in order to enhance the robustness of the algorithm when it is attacked by the attribute columns.
Step 1.3. Determining the range of semantic distortion and calculating tolerance: the value range of the candidate attribute column is usually the semantic distortion range of the attribute column. In the watermark embedding process, in order to ensure that the modification of the data does not exceed the range of semantic distortion defined by each attribute column, the method uses the minimum and maximum values in the data of each column as the lower and upper bounds of the value range, respectively. This method defines the range of semantic distortion of the jth attribute column in ½min½j, max½j. In the process of embedding the watermark, the value of the modified data cannot exceed the range of semantic distortion of the column. The tolerance of the jth column can be calculated using equation (2.1) In equation (2.1), max½j and min½j individually represent the maximum and minimum of the jth column.
Step 1.4 Grouping tuples: in the method, Ks, a secret grouping key, is set by a random method, and the tuples in the relational database are divided into Several non-overlapping groups fG i gi = i, 2, . . . , N g according to equation (2.2) n u = H(KsjH(Ksjt u :PK)) mod N g ð2:2Þ In equation (2.2), n u is a serial number of some group, ''|'' is represents the concatenation operation, H() represents a hash function, Ks represents the secret grouping key, and t u : PK represents the primary key of the tuple as parameters. N g is determined by the number of bits of the watermark to be embedded. For example, if the watermark to be embedded contains 48 watermark bits, then N g = 48.

Watermark embedding
The method requires embedding a watermark 1 bit in each grouping. Therefore, the process of embedding a watermark is an iterative process of performing steps 2.1-2.4 for each grouping. For the convenience of expression, group i is used here to represent any grouping. The process of embedding watermark is shown in Figure 3.  Step 2. In equation (2.3), y represents the value located in jth column and in certain tuple,ŷ is the tolerance of the jth column. p e represents the prediction error value corresponding to y. In equation (2.4), p h is the absolute value of the prediction error of the original database.
Step 2.2. Selecting eligible columns in group i: it searches from the very left side of the histogram to the right in turn to find each column HC j which meets the following conditions.
If HC j .0, then HC jÀ1 = 0 or HC j + 1 = 0 HC j is the height of the j column in the histogram.
Step 2.3. Determining the position of the watermark embedding: the method selects the highest column from all eligible HC j , records it as HC i , and uses HC i as the position of watermark embedding. Record the p h value of the position where the column HC i is located as p i , and use GP i to mark the relationship between the blank position and HC i . If the blank position is on the left of HC i , make GP i = À 1; if the blank position is on the right side of HC i , make GP i = 1.
It performs the watermark embedding operation based on the value of GP i which indicates the relative relationship between HC i and the adjacent blank position. It can embed watermarks according to the following two formulas, respectively. if It can be concluded from equation (2.4) that when p e ø 0, p h = p e ; when p e \0, p h = À p e ; Therefore, equations (2.5) and (2.6) can be further distinguished into two cases, respectively. By substituting p e ø 0 and p e \0 into formulas (3.5) and (3.6) respectively, the following four formulas can be obtained.
if GP i = À 1 and p e ø 0 In summary, the histogram shifting and watermark embedding performed by this method can be finally processed according to the formulas (3.7)-formula (3.10). In addition, in order to extract the watermark and recover data from the database, an array needs to be used here to store the p i and GP i obtained in each group. After the watermark embedding process is completed, the pa will be combined with the Ks generated in step 1.4 as parameters for the process of watermark extraction and data recovery.

Watermark extracting
Detailed algorithm for watermark extraction and data recovery is reported below. The following steps can be used to extract the watermarks and restore the database after the database is attacked by insertion, deletion, and alteration. Before the extraction watermarking, it is necessary to preprocess the database according to the pretreatment Steps 1.1-1.4, and then perform the procedure of Step 3.1-3.3. Among them, y 0 is any attribute value in the D W , and y is the tolerance of the attribute column of y 0 , p e 0 which can be calculated by equation (2.11) is the prediction error corresponding to y 0 .
Step 3.2. Extracting watermark: extracting watermark with the p i and GP i stored in array pa corresponding to group i. Perform the following detection operation on all tuples in group i, and record the number of all watermark bits 0 and 1 detected in this group, then the use majority voting mechanism to determine the final watermark information of this group, and regard the watermark bits with a large number as the final detected watermark bits.
The specific test operation method is as follows: when jp 0 e j = p i , the property value has not been altered, the watermark bit is 0. When GP i = À 1, if p 0 e .0 and p 0 e = p i À 1, the watermark bit detected is 1, if p 0 e \0 and p 0 e = p i + 1, the watermark bit detected is 1; when GP i = 1, if p 0 e .0 and p 0 e = p i + 1, the watermark bit detected is 1, if p 0 e \0 and p 0 e = p i À 1, the watermark bit detected is 1.
After repeated experiments, the watermark extracted by the majority voting mechanism can be used to prove the copyright owner. If the extracted watermark is the same as the current one, it means that the database is real and that the extraction process does not modify it.
Step 3.3. Recovering data: Figure 1 shows watermark extraction and data recovery. According to the positive and negative values of p e 0 and GP i , and the size relationship of sum, the following data recovery is carried out: if GP i = À 1 and p 0 e ø 0 y r = y 0 + 1, p 0 e = p i À 1 y 0 , otherwise ð2:13Þ if GP i = À 1 and p 0 e \0 y r = y 0 À 1, p 0 e = À p i + 1 y 0 , otherwise ð2:14Þ if GP i = 1 and p 0 e ø 0 y r = y 0 À 1, p 0 e = p i + 1 y 0 , otherwise ð2:15Þ if GP i = 1 and p 0 e \0 y r = y 0 + 1, p 0 e = À p i À 1 y 0 , otherwise ð2:16Þ Among them, y r is the restored attribute value. Here is a summary of formula symbols (Table 1).

Experimental results and analysis
In this article, in order to evaluate the performance of the watermark and validation method, we conduct experiments in an Intel Core i5 environment with a 2.40 GHz CPU and 8GB RAM. The selected database was the forest cover type dataset provided by the University of California as the database for testing the watermark. This data set contains 581,012 tuples and 54 attributes. According to the experimental requirements, we selected 10 integer numeric attributes for the experiment and compared them with the existing HGW reversible watermarking method.
The experiment was set up in three parts. The first part is DDR analysis, which compares this method with similar methods reported by Hu et al. 18 and Li et al. 19 The second part is the quantitative analysis of histogram distortion, which shows the influence of this parameter on the performance of the algorithm. The third part is the robustness analysis, which shows the performance of the method under some well-known attacks in the database to test the effectiveness of the watermark. The method used is set as the HGW experiment by Li et al., 19 which generates a composite column as the primary key, and the length (number of packets) of the watermark is 48. For the convenience of comparison, all the following experiments are embedded with watermarks under the same conditions.

DDR analysis
The loss and harm caused by data distortion are huge, so data distortion must be strictly prevented. In this experimental study, in order to better understand the effect of DDR on data quality, we not only evaluated and calculated the watermark length of 48-bit but also selected the watermark length of 24-bit and 72-bit to compare the distortion rate of NSHGW and HGW watermarking methods.
We use the DDR to assess NSHGW and HGW embedded watermark after the distortion of the effects on the database. The larger the DDR, the greater the distortion of the data will be and vice versa. The calculation formula of DDR is as follows In the formula above T dis is the total data amount of distortion, and TD is the total amount of data in the database. Experiments show that the distortion rate of NSHGW is much smaller than that of HGW.
In the experiment, the distortion rates of NSHGW and HGW were verified when the total tuples were 1000, 1200, 1400, 1600, 1800, and 2000, respectively. We use different lengths of watermarks (the number of packets) and get different results. Figure 4 shows the DDRs of the two methods when the watermark length is 24, 48, and 72, respectively.
The horizontal axis represents the number of different tuples in the database, while the vertical axis is DDR, representing the rate of distortion. When DDR is 0, the data in the database is not distorted. Because both NSHGW and HGW rely on stochastic optimization, the distortion rate is the average of the 10 runs of the two methods, respectively.
As can be seen from Figure 4, while embedding the same watermark information into the same database in the experiment, the DDR generated by NSHGW method is much lower than that of HGW, both less than 0.42%.

Quantitative analysis of histogram distortion
In the experiment, in order to identify redundancy shifting brought about by the distortion effect more clearly, we not only for 48 to evaluate watermark length calculation, but also with the above DDR analysis experiments, selected the additional 24 and 72-bit watermark value, the length of the two watermarking methods and NSHGW HGW distortion of histograms multi-faceted contrast. It is important to note that it is necessary to evaluate the distortion effect of NSHGW-and HGW-embedded watermarks on the database by the amount of distorted data caused by shifting (excluding the distortion caused by embedding watermark bits).
In the experiment, we compare the two methods of NSHGW and HGW each time, and verifying the total amount of distorted data caused by the shifting of the two methods of NSHGW and HGW when the total number of tuples is 1000, 1200, 1400, 1600, 1800, and 2000, respectively, under the condition of the same embedded watermark. We use different lengths of watermarks (the number of packets) and get different results. Figure 5 shows the total amount of distorted data caused by the shifting of the two methods when the watermark length is 24, 48, and 72, respectively.
In the figure, the horizontal axis represents the number of different tuples in the database, while the vertical axis represents the total amount of distorted data and represents the number of distorted data. When the total amount of shifting distortion data is 0, it means that the data of the database is not distorted. Since both NSHGW and HGW depend on random optimization, the total amount of distortion data is the average of the results of 10 runs of the two methods. As can be seen from the above figure, when the same watermark information is embedded in the same database, the total amount of redundant translation distortion data generated by NSHGW method is close to 0 and far lower than that of HGW, with a minimum of 12 and a maximum of 29, almost overlapping with the horizontal axis.

Robustness analysis
In this experiment, we mainly study the robustness of NSHGW watermarking method under malicious attack of well-known database. The watermarked data in the experiment will be attacked by different types, such as insertion attack, altered attack, and deleted attack. In order to facilitate the comparison with other methods, NSHGW watermarking method is consistent with the experiments of HGW, GAHSW, and other watermarking methods by Gupta and Pieprzyk, 14 Jawad and Khan, 15 Farfoura et al., 17 Hu et al., 18 and Li et al., 19 and the length of the same watermark is set to 48 for this experimental test. The value of bit error rate (BER) is used to evaluate the robustness of the watermarking method. From the following equation, we can easily see that the lower the BER value is, the higher the watermark robustness will be. BER is the ratio of the number of error-extracted bits to the number of Table 1. Notations used in the article.

Symbol
Description Value calculated by the distance from the position of P i to P iL d iR Value calculated by the distance from the position of P i to P iR hs iL Sum of heights of rectangles from P i to P iL hs iR Sum of heights of rectangles from P i to P iR pa One-dimensional array for storing peak points y Attribute value y^Tolerance of the jth column K S Secret grouping key t u :PK where w i is the embedded watermark bit and w det i the detected watermark bit. Through the analysis of different types of attack data in the experiment, it is verified that the robustness of NSHGW is equal to that of HGW and GAHSW, and obviously higher than that of other watermarking methods. Next, we will carry out the attack experiment in the best case and the worst case, respectively, demonstrate the watermark detection and data recovery results, and compare the current popular watermark methods.
We present the results as a graph, with the horizontal axis representing the percentage (%) of attacks based on the size of the database. This means we need to change the specific rate of tuples which exist in the data set. Zero (0)% attack means no detection of any attack on the data set. We simulate the attacker trying to insert, delete, and modify 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, and 90% of the data in the data set. If the data set contains 1000 tuples, changing 10% means 100 tuples will be attacked. In the figure, the vertical axis is BER, indicating the rate of unsuccessful detection of watermark. When BER approaches 0, the watermark is correctly detected from the database. Because NSHGW relies on random optimization, the BER extracted from the watermark is the average of the 10 runs of the method. The results of the deletion and alteration are drawn as shown in Figures 6 and 7, respectively.
Insertion attack is an attack by inserting a new tuple to affect the watermark detection process. In the database, the attacker adds new tuples to increase the number of false positives and affect the process of detecting data. But the experimental results show that the BER value is 0 before insertion attack. After the insertion attack, BER remains 0 even if 60% of the new tuples are added to the watermarked data set. It is verified that insertion attack will not cause the problem of watermark robustness because our technology only involves tuples with watermark, and tuples without watermark or newly added tuples will not affect the process of watermark detection. Therefore, the three watermarking methods of NSHGW, HGW, and GAHSW are robust for insertion attack. No matter how many tuples are inserted, the BER under the insertion attack is always 0, so the comparison diagram of insertion attack will not be drawn here.
Deletion attack is a process in which an attacker destroys the watermark by deleting tuples randomly. Attackers randomly remove some tuples from the watermarked data set in the hope that some of the watermarked tuples will also be deleted, resulting in the loss of watermark bits. Deleted attack is a process in which an attacker destroys the watermark by deleting tuples randomly. Attackers randomly remove some tuples from the watermarked data set in the hope that some of the watermarked tuples will also be deleted; resulting in the loss of watermark bits. Deleted attack is more damaging than other attack because there is a chance that watermark bits will not be detected   correctly when removing a tuple with a watermark. As the percentage of delete attack increases, the rate of correct detection plummets and is more damaging than other attacks because there is a chance that watermark bits will not be detected correctly when removing a tuple with a watermark. As the percentage of delete attack increases, the rate of correct detection plummets. Figure 6 shows the BER of the watermark extracted by NSHGW, HGW, GAHSW, PEEW, DEW, and GADEW methods after removing the attack. GAHSW, DEW, GADEW, and PEEW extracted watermarks with BER values of 0.477, 0.814, 0.9550, and 0.915, respectively, when the number of deleted tuples in the database increased by 90% when the database was severely attacked by deletion. However, the BER extracted from NSHGW watermark is 0.388 and that from HGW watermark is 0.472. NSHGW can also recover at least half of the watermark. If most of the watermarks contained in the terra-tuple are removed from the database, the watermark cannot be recovered. The harm of deleting attack to database watermark is obvious. Alteration attack is a process in which an attacker randomly altered the tuples of a database to destroy the watermark. Attackers randomly or permanently change some tuple values from a watermarked data set in the hope that some watermarked attributes will be changed altogether, never increasing the difficulty of detecting watermark bits. As the percentage of alteration attack increases, the rate of correct detection plummets. Figure 7 shows the BER of the watermark extracted by NSHGW, HGW, GAHSW, PEEW, DEW, and GADEW methods after alteration attack. The BER of extracted watermarks will increase with the increase of data altered. amount, and the number of altered attributes in the database reaches 90%. GAHSW, DEW, GADEW and PEEW extracted watermarks with BER values of 0.411, 0.434, 0.584 and 0.526, respectively. The BER of NSHGW watermark extraction is 0.356, and that of HGW watermark extraction is 0.383. We can see that NSHGW can recover at least half of the watermark at this point. However, if a large number of watermark attributes are altered, it is difficult to extract the watermark tuples affected by this attack from the remaining unaffected data.
In general, in the deletion attack and altered attack, Figures 6 and 7 show that the robustness of NSHGW, HGW, and GAHSW methods is roughly the same. With the increase of attack intensity, the BER of extracting watermark by these three methods also increases. But compared with other methods, NSHGW has better watermark detection rate under this attack. Experiments also show that the NSHGW method produces a lower BER when embedding the same watermark into the same database.

Conclusions
We propose a new method of reversible database watermarking with low distortion and the originality of the method lies in the grate reduction of the distortion rate of the carrier data. NSHGW method is to use the characteristics of database histogram gap to improve the histogram shifting method, find the best embedding point in the gap, and realize watermark embedding. While embedding watermark, the quantity of redundant shifting of histogram column is avoided, the data quality is guaranteed (the amount of changed data is greatly reduced), and the data distortion caused by watermark embedding is effectively reduced. Experiments show that NSHGW method is better than the existing HGW method in distortion rate analysis and histogram shifting quantitative analysis. The robustness analysis of NSHGW method is similar to that of HGW and GAHSW. Based on the current histogram embedding method and on the basis of reducing data distortion, the future work is to find better and smaller data distortion methods, which is also my research direction.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.