An Optimized Data Obtaining Strategy for Large-Scale Sensor Monitoring Networks

As the technology of the Internet of Things (IoT) becomes more widely used in large-scale monitoring networks, this paper proposes an optimized obtaining strategy (OFS) for large-scale sensor monitoring networks. First, because of the large-scale features of sensor node network, this paper proposes a large-scale monitoring network area clustering optimization strategy. Second, based on the characteristics of regular changes in the sensed data in large-scale monitoring networks, this paper proposes a strategy for acquiring sensor data based on an adaptive frequency conversion. The OFS optimization strategy can prolong network lifetime, reduce the transmission bandwidth resources, and reduce average energy consumption of the cluster head and network energy consumption.


Introduction
In recent years, especially in the big data age [1], the technology of the Internet of Things and the prospects for building applications on this platform have become research hotspots for governments, academia, and industry.A wireless sensor network (WSN) [2], as an important technical aspect of the Internet of Things, can monitor, sense, and sample a wide range of information types from the environment or from monitored objects.A WSN can also process this information in real time [3].Therefore, WSNs are widely used in largescale network monitoring.With the development of wireless communication, sensor technology, and embedded computing technology [4], there is an urgent need for applications involving large-scale wireless sensor networks in various fields including the military, intelligent transportation, environmental monitoring, earthquake monitoring, weather disasters, and modern agriculture [1].However, in these large-scale complex environments [5], wireless monitoring networks pose a series of new problems as follows: the areas that need to be monitored are too large, the number of sensors required is too great, the time overhead of the sensor nodes and the required bandwidth resources and energy consumption of signal transmission are too high.Because monitoring nodes are limited in computing power and storage space, obtaining high-quality sensor data samples and optimizing transmissions to ameliorate the problem of energy consumption [2] and improve the network life cycle have been the core research problems facing the field of large-scale monitoring networks [4,6].
After analysing the existing research results, this paper proposes an optimized obtaining strategy (OFS) to address the issues facing large-scale monitoring sensor data in the Internet of Things.This strategy can effectively improve the overall operating efficiency of the monitoring network, balance energy consumption, and prolong the network life cycle.
The rest of this paper is organized as follows: Section 2 discusses an optimization strategy that is relevant for both current domestic and international wireless sensor monitoring networks.Section 3 deals with large-scale sensor networks.Because the number of nodes is large and their distribution is uneven, this paper proposes a type of large-scale wireless sensor network area clustering optimization strategy in which a large-scale monitoring network is divided into smaller areas 2 International Journal of Distributed Sensor Networks to balance the distribution of cluster heads.It adopts uneven clustering in parallel to alleviate the problem of energy holes [7] in a given area.Section 4 discusses monitoring network data acquisition strategy based on adaptive frequency conversion.This strategy optimizes sensor data sampling using a linear regression model and offers a model compensation mechanism.Section 5 analyses the effectiveness of the proposed optimization strategy through experiments and data comparisons.Finally, the last section provides conclusions.

Related Work
Numerous domestic and foreign experts and scholars have carried out in-depth studies aimed at the existing problems of large-scale sensor monitoring networks for the Internet of Things.Younis and other experts proposed the hybrid clustering protocol HEED [8], which first selects preliminary cluster heads based on the residual energy of nodes and then selects a final cluster head based on the results of a competition to determine the clusters' internal communication costs.The communication overhead of this protocol is significant because it needs to carry out multiple message iterations within the cluster radius.A solution was proposed by [5,9,10] to resolve energy hole problems by using uneven clustering.However, this solution uses a heterogeneous network [2] in which the cluster head is the super node, and it calculates the deployment location of the node in advance, so there are no dynamically constructed clusters.Researchers in [9,11] proposed the EECS clustering scheme, which constructs uneven clustering to balance the load by considering the distance between the candidate cluster head and the sink node, but, in this scheme, residual energy exists only in the local comparison node.It does not coordinate node energy consumption overall, and intercluster communication adopts single-hop communication, which limits the scalability of the algorithm and makes it unsuitable for large-scale networks.In [8,12], the uneven clustering ant colony-based AC-EBUC routing algorithm inherits the advantages of the uneven clustering structure.On this basis, in combination with the ant colony algorithm, it introduces the link reliability parameter and can search multiple paths in real time, but this strategy can easily encounter local optimization problems.A hierarchy of chained network topology was proposed by [13,14].This strategy can add extra cluster head nodes to solve energy hole problems based on certain rules, and it significantly prolongs network survival time; however, because of cost, transmission distance time delays, and so on, this strategy is not feasible in large-scale sensor networks.In [15], a VA-DSC compression algorithm is proposed that adopts Slepian-Wolf [16] coding theory and achieves data independent encoding and joint decoding.The data error rate is small, but it needs to transmit all the data after compression.Consequently, the network energy consumption is still high.The TCDCP algorithm proposed in [17][18][19][20][21][22][23][24][25][26][27][28] can adaptively adjust the acquisition time based on the error between the data and the predicted value of a linear regression model.However, by enhancing the sampling time interval, the absolute value of the error will also increase.Therefore, this algorithm is not applicable in an actual monitoring environment.
The linear regression strategy proposed in [18][19][20][21][22][23][24][25][26][27][28][29][30][31] can accurately measure data, adjust the sampling frequency, and reduce the transmission quantity.However, the algorithm is complex and its requirements are too difficult to achieve for sensor nodes.In addition, this algorithm spends too much time constructing the model.In this scheme, if the cluster head node does not receive data for a long period, the model updating process will result in data loss.

Area Clustering Optimization Strategy for Large-Scale Monitoring Networks
Most of the above optimization strategies are relatively complex, they cannot adapt well to large-scale sensor monitoring networks.In networks with large numbers of sensing nodes, the message volume of the entire network can increase abruptly, reducing efficiency.Therefore, OFS first adopts an area clustering optimization strategy for large-scale sensor monitoring networks; it then utilizes distributed processing to monitor network sensor data [22].

Network Energy Consumption Model.
Assume that  sensors are arranged randomly in a monitored area.These sensors periodically monitor the environment to collect data.The sink node is located in the centre of the area, so the network covers the entire monitoring area.If   denotes the th sensor node, the collection of nodes is  = {  | 1 ≤  ≤ }.This paper uses the typical wireless energy consumption model, as shown in formula (1).When a node transmits  bits of data to other nodes, the distance is ; the energy consumption is the loss sum of the transmitter circuit and power amplification: In formula (1),  elec denotes the energy consumption of the transmitter circuit, and the symbols  fs and  mp denote the energy needed for power amplification in the two models.When the transmission distance  is less than the threshold  0 , power amplification loss adopts the free space model.Energy needed for signal transmission is proportional to the square of distance.Conversely, when the transmission distance  is greater than or equal to the threshold  0 , it uses the multipath fading model, and the energy is also proportional to the fourth power of distance.As the receiver, node energy consumption is only the transmitter circuit loss.Similarly, the energy consumption of the node receiving  bit of data is (2)

Network Partition Strategy.
As mentioned above, in the existing strategies, the cluster head selection requires all the nodes in the network to make a global judgement.When the number of nodes is large and they are unevenly distributed, all nodes are involved in the comparison, which reduces the efficiency of the whole system [23].Therefore, this paper proposes a network partition strategy; Figure 1 shows the network partition topology schematic.Sensor nodes are randomly distributed in the monitoring area.The sink node is located in the centre of the area.As shown in Figure 1, sensor data in the monitoring area are transmitted to the sink node using multihop transmission [22].This can easily lead to an energy hole around the sink node and then the sensor data cannot be transmitted to the sink node, which seriously affects the network lifetime.Therefore, the OFS strategy adopts the hierarchical clustering algorithm AGNES [19][20][21][22][23][24][25][26][27].First, it divides the large-scale network into several subareas, selecting the cluster head and cluster in parallel in each area to boost efficiency.This scheme reduces the energy consumption requirements for all the nodes.
According to formula (1), the energy consumption of data transmission among nodes is closely related to the distance [25].During network partition, the nodes send their location information to the sink node, which then divides the entire network into several subareas based on distance.Each node can belong to only one area.The distribution of nodes in each subarea is relatively uniform.When this division is complete, the sink node broadcasts relevant information concerning the subarea partitions.Using this broadcast information and its own location, each node can then find the subarea to which it belongs.The divided subarea is fixed over the entire network life cycle to reduce energy consumption from repeated clustering.Meanwhile, to prevent overfitting, the clustering operation needs to set the threshold , where  ∈ (0, 1).When the ratio of the number of clustered nodes to the total number of nodes is , the operation halts clustering.In this way, nodes that are evenly distributed will be divided into one area.The nodes in each area will elect the cluster head via local communication.This solves the problem of misdistribution of the cluster heads and reduces the communication cost.Figure 2 shows a schematic of network partition.
As shown in Figure 2, after clustering, the network in Figure 1 will be divided into 3 subareas of different densities.In the process of data transmission, the sensor nodes in the subarea will transfer their data to the selected cluster head nodes.However, the sensor nodes that are not in the subarea are called outliers.When transferring data, these outlier nodes will select the nearest cluster and either transfer their data to the nearest node in that cluster or transfer the data to the sink node directly.In each area, a distributed uneven clustering strategy is used to alleviate the problem of energy holes based on local competition rules that can improve election efficiency and extend the network life cycle.

Distributed Area Clustering
Strategy.An area clustering strategy collects data periodically.The sink node broadcasts a message to perform network initialization, and each node calculates the distance between itself and the sink node according to the strength of the received messages.Candidate nodes participating in the election maintain a neighbour nodes table and elect a cluster head according to certain rules.The following lists the information available for a neighbour node: id, state, Eres, dtosk.
In the above, the  field uniquely identifies one node, and the state field indicates that node's status.The  field represents the remaining energy of the neighbour node, and  is the distance between that neighbour node and the closest sink node.
Rule 1.During the election, if a candidate cluster head   announces that he has won, then all other candidate cluster heads within   's competition radius cannot become the cluster head; they must withdraw from the election.
The neighbour nodes set of the candidate cluster head   contains all the candidate cluster heads that have a competitive relationship with   given the constraint of Rule 1.During the election, the set of neighbour nodes for candidate cluster head is given by and the competitive range of every candidate cluster head  comp [8] is shown in formula (4), where  max and  min represent the maximum and minimum distance between nodes and the sink node, respectively, (  , sk) represents the distance between   and the sink node, and  0 comp is the maximum cluster head competitive radius.The value  is a constant between 0 and 1 used to control the range.The competitive range of candidate cluster heads ranges from (1 − ) 0 comp to  0 comp : From formula (4),  comp is a direct ratio function of the distance between this node and the sink node.The distance of the candidate cluster head is reduced as the radius of the competition is reduced.The aim is to create a cluster that is closer to the sink node and with a smaller size, so that cluster head requires less energy to receive transmissions from other members within the cluster.When this occurs, the problem of energy holes diminishes.
After dividing the network, the clustering strategy divides the nodes in each area to control the distribution of cluster heads based on distance.At this point, the nodes in each subarea are relatively concentrated.Using a time broadcasting mechanism, a time threshold  is set up to control the proportion of candidate cluster heads based on the uneven clustering.Then, it is not necessary for each node to become a candidate cluster head.The average residual energy within each candidate cluster head competition radius and the average distance between the nodes and sink node are shown, respectively, in the following formulas: The value of the time clock is calculated as where  is a random number between 0 and 1 to reduce the possibility of time conflicts for broadcast messages,  0 is defined as the time required for the election of the cluster head,   res is the residual energy of the node   , and   is the average residual energy of node   's neighbour nodes.Formula (7) shows that the candidate nodes closer to the sink node that have more residual energy available for a shorter time have a greater probability of becoming the cluster head.
Within time , if the candidate cluster head node   does not receive a successful message from its neighbour nodes, that node will win the election and become the cluster head; otherwise, the election will fail and the node will withdraw from the election process.After election of a cluster head, the ordinary nodes wake up from their sleep state when the cluster head broadcasts a victory message CH ADV MSG.The ordinary nodes join the cluster based on the received message by sending the JOIN CLUSTER MSG message to the cluster head.In summary, the OFS optimization strategy performs local uneven division in parallel when the number of sensor nodes is large and the distribution is uneven and dynamically sets the time threshold to control the proportions of cluster head competition, reduce the amount of communication transmission quantity, and balance cluster head energy consumption to effectively improve network efficiency and extend the network life cycle.

Adaptive Frequency Conversion Data Acquisition Strategy for Large-Scale Sensor Monitoring Networks
Based on the area clustering described in the previous section that optimizes sensor data acquisition and network transmission energy consumption, this paper proposes an adaptive frequency conversion based sensor network optimization strategy.By analysing the regression model, it can adjust the sampling frequency and update the model dynamically through a mechanism of sensed data compensation and reduced data redundancy.

Frequency Conversion Sampling Model.
A clustered wireless sensor network [20][21][22][23][24][25][26][27][28] has a chain network topology.Figure 3 shows a schematic diagram of the structure of the sensor network.As shown in Figure 3, each sensor node SN can communicate with its next hop node, effectively forwarding data to the cluster head CHN [21] by following a path.

Establishment of Acquisition Model.
Through time series analysis, it is found that the sensor message of a single sensor node is similar in continuous sampling; that is, the collected data at the same node over a given a period of time has a high temporal correlation [22][23][24][25][26][27][28][29].So this study creates a linear regression model that approximately estimates the sensor data [23][24][25][26][27][28][29][30].Figure 4 shows a schematic diagram of the regression model [30].

Fitting a Regression Curve.
Because of the wireless sensor network nodes' limited computing power and storage space, this paper uses a linear regression model to improve the accuracy of prediction and reduce the complexity of the algorithm [24][25][26][27][28][29][30][31] TS can be regarded as a linear function based on the sampling time  as the independent variable and the sampled data value  as the dependent variable [25][26][27][28][29][30][31].The linear regression model is fitted according to the least squares method to acquire the least sampling data and minimize the square of the error of the fitting curve: At the same time, to make the prediction closer to the true values, this paper computes the second-order partial derivative of  to  and , as follows: The values of  and  are the model parameters.The cluster head node utilizes parameters  and  to construct the regression model for a SN node.Then, it can calculate the measurement value of that SN using the model every time the SN would normally take a measurement.This reduces redundant transmissions and the overall energy consumption of the network.

Adaptive Frequency Conversion Acquisition and Optimization Strategy.
Because of the temporal correlation of sensor data [22], sensor data are distributed along the time axis in the prediction model and the optimal strategy can adaptively adjust the acquisition frequency.Figure 5 shows a schematic diagram of adaptive frequency conversion.Set  as the error range,  as the true value of the acquisition time , and  as the difference between the predicted value and the true value; that is,  = |  − |. is the time interval for the acquisition data.
As shown in Figure 5, the actual value of the sensor data will float within the error range, and the initial value of the  threshold  (0 <  < ) is /2.Then, the optimization strategy of a certain period should meet the following rules.
Rule 2. When  ≤ , the model can meet the requirements for the time period and can reduce the sampling frequency, and the model can adjust the sampling interval  =  + Δ (Δ is a one-time interval unit).When  =  max , the threshold value of  decreases exponentially;  = 1/2.Rule 3. When  ≤  ≤ , the actual monitoring value is outside the trend of the forecast model; therefore, the sampling frequency must be increased.In the model, the sampling interval  = /2 is adjusted adaptively by the exponential form.When  =  min , the threshold value of  increases exponentially;  = 3/2.
The OFS optimization strategy adjusts the sampling frequency adaptively by using real-time monitoring data.The alternative changes of the threshold and the time axis are used to prevent the continuous emergence of a minimum or maximum measurement interval.Network energy consumption is reduced by avoiding data transmission as long as there is a guarantee of measuring accuracy.

Failure Data Compensation Mechanism.
As mentioned earlier, when the regression model fails, the network needs to remeasure and fit a new model, but the inflection point at which the monitoring data causes the model to fail also generates a problem of data loss.Because the point is not  predicted and the time of that data inflection point is already in the past by the time the data deviation gets measured, the scheme needs to compensate for this loss of data around the inflection point.Figures 6 and 7 are schematic diagrams that show the failure data compensation mechanism: EA is Model 1 and CF is the updated Model 2. When the old and the new model are replaced, the data will be lost.According to the values of the model parameters, different estimation strategies are used in the compensation mechanism.As shown in Figure 6, if Model 1 and Model 2 have the same sign of parameter , the extension line of Model 1 is AB, and the extension line of Model 2 is CD, then ABCD represents data estimated to be lost.At the same time, the two final measurement points E, A and C, F in Model 2 are the two new starting points selected.These 4 data points will be synthesized into a new linear regression model GH via the least squares principle.In this way, the measurement between Point A and Point C will be deduced at any given moment.At this point, the estimated value must fall in the range of estimation, so the linear regression model of GH is the compensation model for lost data in the time period within [  ,  +1 ].
As shown in Figure 7, if Model 1 and Model 2 do not have the same sign of parameter , AB is the extension of Model 1, CB is the extension of Model 2, and the quadrangle ABCD is the estimated range of the missing data.The two final measurement points E, A in Model 1 and C, F in Model 2 are the two new starting points that are selected.These 4 data points are synthesized into a new linear regression model GH via the least squares principle.At this point, the lost data from In summary, according to the trend of data in the model, the OFS optimization strategy adaptively adjusts frequency and dynamically updates the model in real time based on the error range.Each SN node will return the respective parameters to the corresponding CHN node.According to the least square method, the CHN node can use these parameters to calculate a regression model for each SN node in the cluster and then obtain the node's sensor data.Subsequently, unless the model fails, the SN node does not need to transmit sensor data to the CHN node, which effectively reduces the quantity of transmissions and reduce network energy consumption.

Experiments and Comparison
There are 400 sensor nodes distributed randomly over a 300 m × 200 m area.These sensor nodes monitor changes in temperature during four one-hour time slots distributed throughout a day as follows: 7:00-8:00, 12:00-13:00, 17:00-18:00, and 23:00-24:00.The initial sampling frequency of these sensor nodes is 0.0083 Hz.This experiment tests the feasibility of OFS and gauges its effectiveness using measures of network lifetime, total energy consumption, comparison of node energy balance, error analysis, data acquisition quantity, total quantity of network transmission, and so on.The simulation parameters are shown in Table 1.

Network Lifetime.
Figure 8 shows a comparison of different optimization strategies to maximize network lifetime.Network lifetime can be expressed by the relationship between the numbers of nodes that survive a given number of rounds.At this stage, a cluster head is chosen to join in a round.By capturing the number of rounds from the death of the first node to the death of all nodes, a round can show how well the network balances energy consumption.A greater number of rounds indicate a correspondingly greater efficiency in network energy utilization.
OFS optimizes the residual energy of nodes, the distribution density, and the transmission distance.The network lifetime can be prolonged because weaker nodes can continue to function longer.In Figure 8, compared to LEACH, HEED, and EEUC, OFS prolongs network lifetime by 38%, 15%, and 3.7%, respectively, while also balancing its energy consumption better.

Comparison of Total Network Energy Consumption.
Figure 9 shows the comparison of total network energy consumption for different optimization strategies.To test precisely, when the number of survive nodes drops below 20, we consider the network DEA.
The OFS optimization strategy uses a clustering algorithm to divide the network into zones and generates the optimal cluster structure, which distributes cluster head nodes uniformly in the network and reduces energy loss to alleviate energy holes.In Figure 9, when the network has reached 800 rounds, the total network energy of OFS strategy remains 3.456 J, but other strategies have run out of network energy.This result shows that total network energy consumption using the OFS optimization strategy is lower than others.

Comparison in Consumption of Node Energy Balance.
Figure 10 shows the curves of node residual energy variance for different optimization strategies.The energy balance performance can be tested for all these optimization strategies using 10 random rounds.The function of energy variance is shown as (note: the unit is 10 −3 J) Compared to other optimization strategies, Figure 10 indicates that the OFS optimization strategy has a more stable curve with fewer fluctuations from node energy variance; therefore, OFS performs better in node energy consumption and energy balance than the compared optimization strategies.

Error Analysis.
Figure 11 shows comparisons of sensor data errors using different optimization strategies.In a fixed time slot, this test randomly selects the absolute value of sensor data from 400 sensors.
In Figure 11, VA-DSC simply compresses and transfers sensor data.It has the minimum error; the value is 0.07 ∘ C. The maximum absolute errors using the OFS linear regression strategies are 0.39 ∘ C and 0.43 ∘ C, respectively.As the sample interval of TCDCP increases, the absolute error also increases.Absolute error increases to 0.89 ∘ C after one hour.The test indicates that OFS achieves slightly lower scores than VA-DSC for error control, meaning that it performs slightly better.strategies.The experiment tested the average value of the data collected by the 400 monitoring nodes over 4 time periods.

Data Acquisition Quantity.
OFS utilizes the adaptive frequency conversion optimization strategy, which means it can constantly modify the threshold value  and the time interval  according to change trends of sensor data.By doing this, OFS substantially reduces the quantity of data acquisition required.Figure 12 shows that the average data acquisition quantity in OFS is 241.5 KB.The values in the linear regression, TCDCP, and VA-DSC methods are all higher: 264 KB, 283.25 KB, and 357.5 KB, respectively.This result indicates that OFS performs excellently in controlling the quantity of sensor data that must be transmitted.5.6.Network Transmission Quantity.Figure 13 shows comparisons of network transmission quantity for different optimization strategies.In four time slots, this test selects the average network transmission quantity from 400 sensors.
Figure 13 shows that OFS needs only two regression parameters when building a model.Therefore, its network transmission quantity is minimum and the average is 8.4 MB.In comparison to the linear regression, TCDCP, and VA-DSC models, OFS reduces network transmission quantity by 27%, 80%, and 85%, respectively.These results show that OFS dramatically reduces the quantity of network transmissions required.

Conclusions
With the rapid development of Industry 4.0 and the Internet of Things, large monitoring networks have introduced new problems.The research hotspot for the Internet of Things is still wrestling with these problems.This paper proposes an optimized obtaining strategy (OFS) for acquiring sensor data in large monitored networks connected to the Internet of Things.OFS uses a hierarchical clustering algorithm to divide the network, generating a better clustering structure and reducing network communication overhead.OFS also builds a one-dimensional linear regression model for sensor data that serves to regulate acquisition frequency adaptively, reducing sensor data acquisition and transmission quantity requirements.
The experimental results indicate that OFS can effectively control the energy consumption of sensor nodes to prolong network lifetime.The results of this study provide an effective path for future development of Internet of Things and largescale monitoring networks.

Figure 3 :
Figure 3: Schematic diagram of the structure of the sensor network.

Figure 6 :
Figure 6: Schematic diagram of the failure data compensation mechanism.

Figure 7 :
Figure 7: Schematic diagram of the failure data compensation mechanism.

Figure 10 :
Figure 10: The curves of node residual energy variance.

Table 1 :
Simulation parameters.,   ] and [ +1 ,  +3 ] are beyond the scope of estimation; they need to be calculated again.Using the mean of the method, the compensation model is   = 1/3(   +   ℎ +    ).Respectively,    ,   ℎ , and    represent the intersection of the straight line   = 0 and the model at moment .