Adaptive WSN Scheduling for Lifetime Extension in Environmental Monitoring Applications

Wireless sensor networks (WSNs) are often used for environmental monitoring applications in which nodes periodically measure environmental conditions and immediately send the measurements back to the sink for processing. Since WSN nodes are typically battery powered, network lifetime is a major concern. A key research problem is how to determine the data gathering schedule that will maximize network lifetime while meeting the user's application-specific accuracy requirements. In this work, a novel algorithm for determining efficient sampling schedules for data gathering WSNs is proposed. The algorithm differs from previous work in that it dynamically adapts the sampling schedule based on the observed internode data correlation as well as the temporal correlation. The performance of the algorithm has been assessed using real-world datasets. For two-tier networks, the proposed algorithm outperforms a highly cited previously published algorithm by up to 512% in terms of lifetime and by up to 30% in terms of prediction accuracy. For multihop networks, the proposed algorithm improves on the previously published algorithm by up to 553% and 38% in terms of lifetime and accuracy, respectively.


INTRODUCTION
Wireless Sensor Networks (WSNs) consist of nodes which detect and track real world quantities [1]. Nodes are autonomous and are able to self organize into intelligent networks. Each node consists of a micro controller, memory, a radio transceiver, and sensors. Most WSN nodes are battery powered. The limited supply of energy means power consumption is a major issue in WSNs. In most applications, the radio transceivers are the largest consumers of energy [2]. Consequently, much research has been conducted on reducing the amount of time that the radio is on ( [3], [4], [5]).
An important application area for WSNs is environmental monitoring [1]. Environmental monitoring applications require that a physical quantity is periodically measured and the measurements are relayed across the network to the base station, or sink, for processing. In many cases, the base station must maintain an up-to-date (online) view † E-mail: Jong.Lim@ucdconnect.ie of the physical quantity being measured. Thus measurements must be transferred to the sink as soon as they are available [6] [7] [8]. WSN measurements of data, such as temperature, humidity, air pressure, wind speed, nitrogen dioxide, and light, often exhibit strong spatial correlation between nodes and strong temporal correlations between different sampling times at the same node [9] [10] [11] [12]. Knowledge of these correlations can be exploited to reduce the number of measurements needed to meet the application-specific sensing accuracy requirements. For example, if outdoor temperature varies more slowly at night than during the day, the sampling rate can be scaled back during the night and increased during the day without unduly affecting accuracy. The missing data can then be estimated (imputed) based on the data actually collected. This saves energy by reducing the amount of data transmitted during the night since nodes can be scheduled to enter sleep modes when they are not needed (see Section 2 for more details).
Clearly, there is a tradeoff between sensing accuracy and lifetime [13] [14]. In general, it can be said that improved accuracy requires collection and transmission of a greater number of sensor measurements which, in turn, means shorter network lifetime. The efficiency of a particular data collection schedule depends on the characteristics of the data being collected. These characteristics vary with time. Hence, the natural question arises, for a given environmental monitoring application, how can the data gathering schedule be determined and dynamically adapted so as to maximize network lifetime while still meeting the application accuracy requirements?
In this work, we propose a new adaptive scheduling algorithm for WSNs which can be used in environmental monitoring applications. The algorithm determines the sampling schedule based on user specified accuracy goals, network connectivity and a preliminary data collection phase (as most monitoring applications gathers data continuously at the sink, running a preliminary data collection would cost nothing.). During preliminary data collection, data is collected from all nodes at the full rate. The preliminary data is divided into training and evaluation data sets. The training data is used to build spatial and temporal models of the data relationships. The evaluation data is used to assess the performance of various candidate scheduling strategies. The models developed in the training phase are used to impute data which is not scheduled for collection according to the candidate strategy. The results of the imputation are compared with the measured data. The schedule which meets the user's accuracy requirements and maximizes network lifetime is deemed to be the most efficient and is applied to the network during the operational phase.
The algorithm supports schedule adaptation to allow for the time varying nature of the data relationships. Firstly, the algorithm divides the day a number of time periods or slots. A different sub-schedule is allowed in each slot. This allows the algorithm to adapt to the differing degrees of correlation present in the data at different times of the day, e.g. midnight versus midday. Secondly, the accuracy of imputation is assessed during the operational phase. If the accuracy drops below the user specific accuracy requirements, the slot is re-trained and the sub-schedule updated. This allows the overall schedule to track long term changes, such as the lengthening of daytime during spring.
The algorithm differs from previous work in that it supports dynamic adaptation of schedules. The algorithm supports sub-sampling and round-robin subsetting scheduling strategies. Variants of the algorithm are proposed for two-tier and multi-hop networks. The performance of the algorithm is assessed by simulation using real-world data sets. The algorithm is shown to significantly extend network lifetime when compared with a previously published scheduling algorithm. In terms of the round-robin sub-setting algorithm proposed herein, it is different from coverage based sub-setting algorithms [15] [16] [17] in that it uses a data similarity metric rather than physical distance to measure correlation when forming subsets. The benefit of doing this is explained in section 2.
The remainder of this paper consists of five sections. Section 2 describes related work. This is followed by an explanation of the problem in Section 3. In Section 4, the proposed algorithm is described. In Section 5, the experimental method is described. In Section 6, the results and their implications are provided. Finally, the paper ends with conclusions.

RELATED WORK
Two network topologies are commonly used for WSN applications: two-tier and multi-hop. shows an example of a two-tier network and a multi-hop network. In the two-tier case, all battery powered nodes have direct communication links with mains powered nodes (master node) which can communicate data to the sink. In the multi-hop case, only the sink is mains powered and all communication must be routed to it via battery powered nodes. In the two-tier case, power consumption per node is proportional to the number of measurements per unit time. In the multi-hop case, power consumption per node is, in the conventional case, not proportional to the number of measurements per unit time, since the routing nodes must be on all of the time. However, in recent research, a number of authors have proposed cross-layer network protocols in which network availability is optimized so that it closely matches the application data transmission requirements [18] [19]. This approach, assumed herein, significantly reduces energy consumption and means that the power consumption per node is proportional to the number of measurements per unit time in the multi-hop case as well.
The scheduling algorithm proposed herein is targeted at environmental monitoring applications in which all of the data is immediately sent back to the sink. Since all of the data is sent to the sink for data gathering purposes, it makes sense to use this data for centralized scheduling as well. This obviates the need for energy inefficient intra-node schedule negotiation and allows for exploitation of multi-hop data correlations. In addition, much more computationally complex scheduling algorithms can be used at the sink than can be performed on the nodes, further improving performance.
Reducing the amount of data gathered in a WSN can be done by sub-sampling or sub-setting. Sub-sampling is the process of making measurements less frequently, e.g. a sub-sampling ratio of 2 would increase node sampling periods from 1 minute to 2 minutes. Round robin subsetting is the process of using only a proportion of the nodes at any one time in a round robin fashion, e.g. a sub-setting ratio of 2 would mean that half the nodes are sampled in even numbered minutes (1, 3, 5,..) and the other half are sampled in odd numbered minutes (0, 2, 4,..). Both of the examples halve the energy consumption of the network but. The level of accuracy in imputing missing data varies depending on how strong the data is temporally or spatially correlated. The algorithm proposed in this work uses both sub-sampling and round robin subsetting.
A number of publications have dealt with sub-sampling [20] [21] [22]. In all cases, measurements are suppressed, i.e. not transmitted, if there can be accurately predicted based on previous measurements. The suppression can either be a priori, before the measurement is taken, or post prior, after the measurement is taken. As will be seen, depending on the data set, sometimes sub-setting outperform sub-sampling and sometimes vice verse. Hence the proposed approach supports both sub-setting and subsampling.
Several publications have proposed algorithms for subsetting. These algorithms can be classified according to whether the sub-setting decision is made based on the geographical coverage of the nodes or based on the data sensed by the nodes. Coverage-based schemes attempt to schedule nodes such that the entire area of interest is covered by the fewest sensor nodes [15] [16] [17]. The difficulty with this approach is that when obstacles are present within the area being monitored, sensor readings will not be well correlated with location [23]. In such cases the predominantly assumed disc shaped sensing radius no longer hold true. For example, two sensors may be close together but be on different sides of a wall. In addition, node location information may not be readily available. Hence, in this work, we focus on data similarity-based approaches. Another benefit of using a data similarity/correlation approach is that it can detect correlation changes in the environment over a long period of time. In this paper it is shown that as spatial correlations change remodeling/retraining has to be done to maintain a high quality of data gathering service.
A number of methods have been proposed for subsetting based on data similarity. These methods can be grouped according to whether they use a centralized or distributed approach. In the centralized approach, the sink determines the sampling schedule whereas in the distributed approach, the nodes themselves decide on the sub-sets. The disadvantage of the distributed approach is that, if subsets are large, initializing and maintaining them requires a significant amount of internode communication, as in KEN [24]. As a consequence, Contour Maps and CAG [19] limit the range of sub-sets to one hop. The disadvantage of this is that long distance correlations cannot be exploited. Furthermore this subsetting algorithms do not use a round robin scheme thus achieving poor load balancing.
Herein we compare the proposed approach with the algorithm (which is named GUPTA in this paper) described in [18]. The GUPTA algorithm uses a data driven approach and two-tier and multi-hop versions are described. Unlike the algorithm proposed herein, the GUPTA method does not consider temporal correlations, adaptive scheduling, load balancing or slotted scheduling. In the multi-hop version the GUPTA algorithm is semidistributed because even though nodes make individual decisions whether to join a subset, it requires a centralized data gathering phase in order for all the nodes to gather training data from its neighbors.
In order to achieve load balancing for two tier networks two systems have been previously proposed which incorporate round robin sub-setting [25] and [26]. The system proposed in [25] converges slowly, forming multiple clusters before finding a satisfactory solution. This means that the system produces a significantly higher number of schedules thus making it difficult to maintain. The system described in [26] was developed by the authors of this paper as a prototype. The version described in this paper has a number of improvements. In addition to that we propose a novel network optimized load balanced subsetting for multi-hop networks.
Two systems have been previously described which use both sub-setting and sub-sampling -KEN [24] and Contour Maps [27]. Unlike the proposal described herein this algorithms do not perform any network level optimization, in the sense that nodes will still have to switch on their radios periodically to listen for packets as well as to relay packets even when they have no readings to send. Furthermore round robin sub-setting is not used.
Combining statistical WSN data models with probabilistic queries to improve the cost-effectiveness of WSN queries was investigated in the BBQ system [28]. However, BBQ focuses on multiple one-shot queries over the current state of the network, rather than continuous data gathering. In [29] SeReNe a scheduling algorithm for answering queries is proposed. Similar to BBQ and the proposed method herein it first gathers historical sensor readings. Through clustering SeReNe builds a subset of Representative Nodes to answer queries. The disadvantage of that is that for long term queries SeReNe does not employ a round robin scheme to achieve load balancing. In [30] the authors of SeReNe make a brief discussion on possible ways of adapting the model over a long period of time but this was not evaluated. KEN uses data models as well to answer queries. KEN and SeReNe are similar in the sense that they are push based methods whereas BBQ is a pull based method. Herein, the user sets a probabilistic accuracy target a priori and possible schedules are assessed with respect to the target prior to their application.
A comparison of the various data similarity based scheduling algorithms that have been proposed is provided in Table I. The algorithm proposed herein is the first to support schedule adaptation and round-robin sub-setting.

PROBLEM STATEMENT
The goal of the scheduling algorithm is to determine the network sampling schedule which minimizes network communication for the worst case node while ensuring that application level accuracy requirements are met. The reason for minimizing communication of the worst case node is to maintain load balancing thus enabling the network to continuously gather data from all nodes within the network continuously for a longer period of time. Even though sensor data of dead nodes can still be spatially imputed, because the node is dead, validation and retraining of the spatial correlation cannot be done when needed.
The user defines the accuracy requirement by setting a limit on the average probability (P lim ) of errors greater than a specified threshold (E lim ). For example, the user might require that 95% of reported measurements have a error of less than 0.5 o C. In the case of measured values the error e is equal to zero. In the case of imputed values, the error may be greater than zero. The goal of the algorithm is then to determine the schedule S ch which minimizes the number of packets Np transmitted by the worst case node such that the probability p(e) of errors less than E lim is greater than P lim .
As stated previously, data correlations can be exploited in order to impute the missing values. In most previous work, these correlations are assumed to be static. Fig. 3 shows the variation of temperature at three nodes over a day in a real-world dataset. Clearly the rate of change and inter-node data correlations are dependent on the time of day. Thus scheduling algorithm should account of the fact that data correlations drift during the day and, for best performance, should use different sub-schedules at different times of the day. In addition, over long periods of time the temporal and spatial correlations which exist in the data vary. Thus, imputation becomes less accurate. This deterioration in performance should be detected and the models re-trained. When sub-setting, it is desirable the subsets are disjoint and operate in a round robin fashion so that the network is load balanced. Disjoint subsets are subsets such that for any two subsets Ci and Cj, Ci ∩ Cj = ϕ, i.e. every node belongs to only one subset. In the two-tier case, determining disjoint subsets which provide accurate imputation of environmental conditions at all nodes is nontrivial. In the multi-hop case, the problem is more complex since every disjoint set must provide a representative node to represent each correlated region while also ensuring connectivity between all the nodes in the subset and the sink. For the example, the three disjoint subsets in Figure  4 allow both load balanced sub-setting and continuous connectivity while having each correlated region being represented by a node.  Figure 5 shows the performance of sub-setting method and sub-sampling with 75% of the data being predicted. Both methods are explained in detail in the  [19] x Distributed x x GUPTA [18] x Semi-Distributed x x x KEN [24] x Distributed x x SeReNe [29] x Centralized x x x RRC [25] x Centralized following section. The figure shows that both algorithms perform well in the morning and at night. During the afternoon, both algorithms experience a significant loss in performance. Thus, on average, even if the accuracy of the method meets the user's requirements initially it does not mean that the requirements are met throughout the day.
To ensure user requirements are met, the amount of data being predicted during the afternoon has to be decreased. This can be done by reducing the sub-sampling/sub-setting ratio.

PROPOSED ALGORITHM
In this section we explain the proposed Slotted-Scheduling algorithm with variants for two-tier (SS-2T) and multi-hop (SS-MH) networks. The following sub-sections provide an overview of the algorithm; explain how schedules are defined; describe how data imputation is performed; explain node to subset allocation for round-robin subsetting in both two-tier and multi-hop networks; explain the schedule selection process and detail the schedule update method.

Overview
Initially, the Slotted-Scheduler gathers training and evaluation data and, in the multi-hop case, connectivity information from the network. During training and evaluation data collection, all nodes collect data at the user-specified maximum collection rate and transmit this  Figure 6. Slotted-Scheduler Timeline and Network Activity data back to the sink. At the sink, the training data is analyzed, on a slot-by-slot basis, to build models for data imputation. The data from the evaluation phase is then used to assess the performance of various candidate scheduling strategies, i.e. various ratios of sub-setting and sub-sampling. The sub-schedule which meets the user's accuracy requirements and minimizes energy consumption is selected for application to the network in that slot during the operational phase.The selected data collection schedule is transmitted from the sink to the nodes. The network then enters the operational mode and data is collected according to the schedule. Data collected is monitored in order to detect changes in temporal/spatial correlation. If changes are detected, the network re-enter the training and evaluation phases in order to update the models and schedule. Figure 6 illustrates how the Slotted-Scheduling algorithm operates. The figure shows a 4 slot schedule with sub-setting, sub-sampling, full rate collection and subsetting in the first, second, third and fourth slots, respectively. The figure also shows the temporal sequencing of the establishment, training, evaluation and operational phases. The operational phase is divided into a series of slots which repeats.

Schedule Description
The schedule is based on the user-specified default data collection period. This is the maximum rate at which data can be collected, i.e. with no sub-sampling or no-subsetting applied. The schedule is divided into a number of slots, or time periods, which span the day. A different subschedule can be specified for each slot. This allows the scheduler to adjust the data collection rate depending on time of day. For example, in a schedule with eight slots, each slot would last for four hours: slot 0 from midnight to 4 a.m., slot 1 from 4 a.m. to 8 a.m. and so on. Within each slot, the node sampling sub-schedule is specified as the sampling rate at which the node samples relative to the default collection rate. For example, a sub-sampling rate of 100% means that a node collects data at the default collection rate. The node sampling offset is used to indicate which data collection round the node starts to sample relative to the start of the operational phase.
A node schedule consists of: • node id • eight bits indicating default data collection period (in minutes) • eight bits indicating slot length • for each slot: three bits indicating the sampling rate five bits indicating the sampling offset The duration of a slot equals the slot length multiplied by the default data collection period. There are eight different sampling rates which can be used with the maximum being 100% and the minimum 12.5%.
For a twenty four slot schedule, a single node's schedule (excluding node id) is 26 bytes. In TinyOS (which has a default data packet payload of 28 bytes [31]) the cost of sending the schedules from the sink to the nodes is roughly equivalent in terms of energy consumption to sending one data measurement from all of the nodes to the sink. Piggybacking and compression schemes can be used to reduce this overhead. Data collection timing can be maintained using node wake-up synchronization [32].
Herein, we refer to data which is scheduled for collection as collected data and data which is not scheduled for collection as non-collected data. Non-collected data must be imputed based on collected data.

Data Imputation
In the case of sub-sampling, imputation is performed using Linear Prediction (LP). The Linear Predictor determines the coefficients of a forward linear predictor by minimizing the prediction error in the least squares sense based on the training data. During the operational phase, LP is used to estimate the non-collected data as a weighted sum of previous measurements obtained at the same node where xi(i, t) is the current imputed sample at node i at time t, xo(i, t − r) is the observed (measured) data at node I at time t − r, a(i) are the coefficients of the linear predictor, r is the sub-sampling ratio and p is the length of the predictor.
In the case of sub-setting, only one subset of the network is collected in each data collection round. Given that subset Ci is the operating subset consisting of the nodes s1, s2, ..sL then the predicted value of a node is Given that the training data for a single node and the remaining nodes is o and O respectively than the weighted coefficients are

Round-Robin Sub-setting
To achieve load balancing, every node in the network is allocated to a sub-set and the number of nodes per subset is constant. The key to accuracy is in allocating the nodes such that every sub-set contains a set of nodes which accurately represent environmental conditions over the whole network. Novel algorithms have been developed to solve the node allocation problem for sub-setting in twotier and multi-hop networks.

Two-Tier Networks
In the two-tier case, node to sub-set allocation is achieved by node clustering, followed by sub-set allocation, and allocation optimization.
Initially, nodes are clustered based on data similarity. Nodes are clustered using a Normalized Cut (N-cut) clustering algorithm [33] based on an entropy S metric. In this way, nodes with strong data relationships are put in the same cluster.
where Σ is the covariance matrix of data obtained from nodes i and j. After clustering, node allocation is performed. The first node subset is formed by selecting one representative node from each cluster. In this way, the subset consists of nodes which represent the measurements in each cluster. The representative node is chosen as the node with the minimum total entropy Smin within the cluster.
where Nc are the nodes within the cluster, i is the current node id and j is the id of the other node. The second subset is found by excluding the already allocated nodes from the set of available nodes and repeating the representative node selection step. This process is repeated until all of the nodes in the network are allocated to a subset.
The sequential subset allocation process can lead to poor results as the subsets allocated later in the process tend not to perform as well as those allocated earlier in the process. To address this, a Genetic Algorithm (GA) is applied to optimized the node allocation. First, two subsets are picked at random. Second, one node is chosen from each subset and they are swapped. Third, if the swap causes the sum of the entropy of the two subsets to increase then the swap is made permanent, otherwise the subsets revert back to their original states. The full sub-setting algorithm is described in Algorithm 1.
Subset allocations and models are generated in this way for a range of sub-setting ratios. The allocations are saved for later evaluation, see subsection 4.5.

Multi-Hop Networks
In the multi-hop cases, allocation of nodes to sub-sets is performed in a different way. This is because, in multi-hop networks, all sub-sets must provide connectivity between all nodes in the subset and the sink. The algorithm works by growing the maximum number of subsets from the sink based on connectivity information and data similarity.
Using a distance criteria the algorithm determines which nodes are one hop away from the sink. Nodes which are one hop from the sink each form the root of a new subset. Thus the number of new subsets found is directly proportional to the distance criteria. A larger distance criteria will yield a larger number of subsets. The X=All sensor nodes TL=Tranmission Range Limit C subsets are formed one for each node xn within the TL of the sink Nc=number of subsets (equivalent to number of 1 hop nodes from the sink) Pick node xn which is 1 hop from Ci and has highest average Entropy with Ci Combine each subset based on Entropy Nc=new number of subsets Save C end Algorithm 2: Pseudocode for multi-hop round-robin subset allocation subsets are grown by selecting the nodes according to the following criteria: • are 1 hop away from a node currently in the subset • has the highest difference in average entropy between the nodes within the subset The subsets are grown in a round robin fashion. If a subset cannot be grown then the method continues growing the other subsets. Once this maximum number of subsets have been formed, the method then combines subsets in order to form larger subsets which are better spread over the network. The average difference in entropy between all subset pairs is found. Subsets with the greatest difference are combined. This step is repeated until all subsets have been combined. At each step, the subset allocation is saved for later evaluation, as described in the next sub-section.

Selecting the Best Schedule
The performance of all possible sub-sampling and subsetting strategies is assessed for each slot. The subsampling or sub-setting sub-schedule giving the best performance is selected for application to the network in that slot during the operational phase. The various sub-scheduling options are assessed using the evaluation data. In each case, the non-collected data is imputed and the result compared to the measured data to give the imputation error e e(i, t) = abs(xi(i, t) − xo(i, t)) The standard deviation of the error σ calculated over the whole network during the evaluation period is calculated. This is compared to the error target specified by the user. The target standard deviation of the error is calculated by projecting the target error limits (percentage of errors greater than threshold) onto a Gaussian probability distribution and finding the equivalent standard deviation σ lim . Sub-schedules which lead to error standard deviations in excess of the target σ > σ lim are rejected. Since the schedules are load balanced by construction, the energy consumption of routing is equal in all cases. Thus, the energy consumption is proportional to the number of collected measurements. Therefore, the remaining sub-schedule with the least number of measurements is selected for application to the network. The final schedule is determined by concatenation of the selected sub-schedules. If appropriate, the schedule can be compacted by merging consecutive sub-schedules that are the same, provided that the slot lengths remain equal.

Schedule Update
During the operational phase, the algorithm monitors the accuracy of the spatial and temporal data imputation models. This allows the system to determine if the data characteristics have drifted since the models were last trained. This is done by comparing the prediction accuracy seen when the training and evaluation data is used compared to the prediction accuracy seen with the current received sample.
In the case of the temporal model, the model is tested by predicting the current received sample and testing it with the last received sample (which is y samples away). This prediction is done using using equation 2. The error is found between the current received sample and the predicted sample. Next using only evaluation data, the data from the same time slot is predicted with data which is y samples away . A comparison is done between the error found using the current received sample and the error found using the evaluation data. A node is marked when the error difference is above a threshold limit. When the percentage of marked nodes is above T lim for a duration of D lim days then retraining is triggered. For the spatial model, it is first tested using the current samples received from the nodes of the current operational subset Ci. Using equation 3 each received sample is imputed using the other received samples at that particular time slot. The error between the predicted value and the actual value for each sample is found. The error results are than compared with the results when the same test is repeated on the evaluation data using the same time slot and the same subset of nodes Ci. A node is marked when the difference between the prediction error (using current received samples) and the prediction error (using evaluation data) are above a certain threshold. Similar to the temporal model test when the limits of T lim and D lim are broken retraining is commenced.

EXPERIMENTAL METHOD
The algorithm described in the previous section was implemented on Matlab and tested on two datasets taken from the Lausanne Urban Canopy Experiment (LUCE) [34]. Table II provides a summary of the datasets. Results were evaluated in terms of mean imputation error (see Eq. 7), percentage of non-collected data, variation of the number of operational nodes with time, and network lifetime. Mean imputation error is the mean error of the imputed non-collected data. The percentage of non-collected data (P N D) is related to the amount of data transmitted and thus to the lifetime of the network. The percentage of data collected and transmitted to the sink is 100% − P N D. We compare the results for different systems in terms of two definitions of lifetime. The first definition of lifetime is L 100% which is the length of time for which all nodes are alive. The reason for choosing this metric is because when the first node dies, this node can no longer be used for retraining. Thus if the node's data correlation with other nodes change this cannot be corrected thus rendering the imputed readings from the other nodes void. The second definition of Lifetime is L 50% which is the length of time for which 50% or more of the nodes remain alive .
Scheduling algorithms such as [32], [35] reduce idle listening significantly through the proper use of schedules. Such algorithms make power consumption of sensor nodes closely proportional to the number of transmitted packets. For each simulation done each sensor node is initialized with a limited number of battery power. Every transmitted packet is set to consume 1 unit of battery power. We assume the network allows piggybacking thus ensuring that even in the multihop case only a single packet is transmitted by each node during each sampling cycle. Similar assumptions were made in [18].
The performance of the proposed algorithm is compared to that of the Default Network and to the GUPTA algorithm. In the Default Network, every node collects data every collection round, i.e. all data is collected.
There are two variants of the GUPTA algorithm used herein. GUPTA-2T for two tier networks and GUPTA-MH for multi-hop networks. For the GUPTA algorithm, initially when the correlation structure is unknown, all the network nodes are periodically involved in transmitting data to the data-gathering node using a communication tree. Using this setup, each node then collects data from its d-hop neighbors using a piggyback scheme. In GUPTA-MH simulations, each node collects 3-hop neighborhood information.
GUPTA-2T algorithm proposed in [18] is used on a multihop network. In [18] during each iteration the number of nodes which can join the Connected Correlation-Dominating Set (CCDS) are bounded by the number of hops. As the GUPTA-2T algorithm used herein is used on a two-tier network the algorithm is no longer bounded by hop count. The benefit of this is that there is a wider selection of nodes which can be added to the operating dominating set.
The GUPTA-2T algorithm works by adding nodes which will give the most benefit to the dominating set. This is continuously done till there is no more benefit in adding nodes. Given that IM is the group of nodes which can be inferred by M and newIM the nodes which can be inferred by M ∪ si (si is any node not belonging to M ) then the benefit function is • The node s has not been mark selected • The connectivity of the communication subgraph is not affected by the deletion of the node s • There is a correlation edge in the correlation graph such that every node in the set S is either marked selected or has a priority more than p(s).
The score p(s) is the sum of the number of nodes which are correlated with the node s. The more nodes which can be predicted by s the higher p(s) will be.
The number of messages sent during training is not considered in the results as both algorithms require a training phase. For the GUPTA algorithm the first 14 days are used to build the model. In the case of the proposed algorithm the first 7 days were used to build the spatial/temporal model (training) while the subsequent 7 days were used to assess the performance of various subsetting and sub-sampling ratios (evaluation). It is assumed that the underlying network is able to handle packet loss. In the multi-hop case, two nodes are assumed to have connectivity if they are less than 135 meters apart.
In the case of adaptive scheduling, two days of data were used for re-scheduling the nodes. The first day is used for Re-scheduling was triggered if 60% T lim of nodes are less than the user specified error threshold for two days (D lim ). When testing rescheduling nodes with more than 6% of missing data were deleted from the dataset as this impeded rescheduling.

RESULTS
This section is divided into two subsections covering the two-tier and multi-hop cases, respectively.
Variation in the number of errors in excess of the limit with time Figure 9. Performance of scheduling algorithms, ST LUCE dataset, two-tier network

Two-tier Network
Firstly the performance of the various sub-setting, subsampling and imputation methods were assessed using the ST LUCE data set and a two-tier network. Figure  7 shows the variation in mean imputation error with the percentage of imputed data for various methods. Three methods were compared -sub-sampling (SAMP), two-tier sub-setting (SET-2T), and the full Slotted Scheduler algorithm including both sub-setting and subsampling (SS-2T). Rescheduling was switched off. For low imputation percentages, sub-sampling performs better than sub-setting. For high imputation percentages, sub-setting performs better than sub-sampling. The proposed Slotted Scheduling algorithm combines the advantages of subsetting and sub-sampling and performs best in all cases. Figure 8 shows the schedule created by the Slotted Scheduler for an error limit of 1.4 o C in 80% of cases. The figure shows that the choice between sub-setting versus sub-sampling as well as the ratio varies during the day depending on the data statistics. Between the times of 00:00 and 09:00 sub-setting is scheduled for use. During that period, only one seventh of the nodes were scheduled to sample and transmit at each sampling period for the majority of the duration. From 09:00 till 17:00 (during the day), sub-sampling is used with a sampling ratio of 1:2. From 20:00 onwards, the scheduler reverts back to the use of sub-setting.
The GUPTA and Slotted Scheduling algorithms were compared using an error limit of 0.25 o C and of 1.2 o C in 80% of cases, respectively. Re-scheduling was switched off. Figure 9(a) shows the mean imputation error of both methods. In terms of prediction accuracy the Slotted Scheduler performs 29.5% better. A box plot of the prediction error is presented in Figure 9(b). The box plot clearly shows that in terms of the distribution of errors SS-2T performs better as well. Figure 9(c) shows the number of operational nodes over the duration of the simulation for both methods and for the Default Network. The packet limit was set to 2,150 packets. Using the GUPTA algorithm, nodes start to die much sooner than when using the proposed algorithm. Table III shows the in terms of prediction accuracy as well as lifetime SS-2T outperforms GUPTA-2T. Figure 9(d) shows how the percentage of errors that are in excess of the error limit varies across the time slots. It can be seen that, for the GUPTA algorithm, the number of errors varies significantly over the slots. The proposed algorithm performs within the 80% precision limit (P lim ) for all time slots. Figure 10 compares the performance of SS-2T with rescheduling on and off. The initial loss in performance in both cases (days 10-28) is due to the large amount of missing data in the dataset. The algorithm signals for rescheduling during the 11th day of operation but because of the lack of data it was not done till day 26. Overall, the algorithm with re-scheduling switched on gives an average of 81% prediction errors which are less than the error limit, while without re-scheduling 65% are less than the error  Figure 10. Performance of SS-2T with re-scheduling on and off, ST LUCE dataset, two-tier network limit. The version with re-scheduling on requires 58% more packets than the algorithm without re-scheduling. Even so, the number of packets transmitted by SS-2T with re-scheduling on is four times less than the default network. Figure 11(a) shows the results of performance assessment for the two-tier Slotted-Scheduler and the GUPTA algorithms using the RH LUCE dataset. For the GUPTA algorithm the error limit was set to 0.75 o C. For the Slotted-Scheduler the error limit and precision limit were set to 2% and 80% respectively, and re-scheduling was switched off. In both cases the packet limit was 2,150. As can be seen, the Slotted-Scheduler provides greater accuracy: 19% better than GUPTA. Figure 11(b) shows that the nodes running the GUPTA schedule die faster. Table III summarizes the results obtained. Figure 12 compares the performance of the SS-2T with re-scheduling switched off. With re-scheduling switched on, the average percentage of prediction errors below the error limit after day 45 is 80% while for re-scheduling off it is 75%. In terms of transmitted packets, during the operational phase, the version with re-scheduling transmitted 85% more packets than the version without. The re-scheduled version transmits three times less packets than the default network. Figure 13(a) shows the performance of the multi-hop algorithms for the ST LUCE dataset. The error limits for the GUPTA and Slotted Scheduler algorithms are 0.1 o C and 0.9 o C in 80% of cases, respectively, and rescheduling was switched off. The packet limit is 3,700. The accuracy of the Slotted Scheduler is 38% better than that of the GUPTA algorithm. In terms of the distribution of prediction error figure 13(b) shows that SS-MH performs better than the GUPTA algorithm. The Slotted-Scheduler also performs better than the GUPTA algorithm in improving the lifetime in terms of both L100 and L50.    Slotted-Scheduler performs within the user specified error limit.  Figure 17. Variation of accuracy with percentage of noncollected data Figure 14 shows how the accuracy of the sub-sampling (SAMP), multi-hop sub-setting (SET-MH), multi-hop Slotted Scheduler (SS-MH) and two-tier Slotted Schedule (SS-2T, re-scheduling off) varies with the percentage of imputed data for the ST LUCE dataset. Again, the performance of Slotted Scheduler performs better than the sub-setting and sub-sampling algorithm. The performance of the multi-hop Slotted-Scheduler is similar to that of the two-tier algorithm, even though the subsets are constrained in that they must all provide connectivity to the sink for all nodes. Figure 16 assesses SS-2T with and without rescheduling. Using re-scheduling, the average percentage of prediction errors less than the threshold increase from 65% to 84%. This was achieved at the cost of an 83% increase in the number of packets. As in the two-tier case, even though the algorithm signaled a retrain on day 11, it was unable perform the retrain for several days due to the amount of missing data. Figures 15(a) and 15(b) show the performance of the multi-hop algorithms for the RH LUCE dataset. The error limits are 0.1% for the GUPTA algorithm and 1%C in 80% of cases for the Slotted Scheduler algorithm with re-scheduling off. The Slotted Scheduler outperforms the GUPTA algorithm in terms of both accuracy and lifetime. Accuracy and lifetime summaries are provided in Table  IV for two cases. Figure 17 compares the performance of the multi-hop sub-sampling, sub-setting and Slotted Scheduling algorithms with the two-tier Slotted Scheduler. The previous findings are again confirmed. The findings are similar to the two-tier case.

Conclusions
Environmental monitoring applications requires nodes to continuously transmit data back to the sink. In this paper we have proposed a method which can use the initial collected data to find spatial and temporal correlations within the data. It has been shown that the performance of these spatial and temporal models varies across time, between data sets and network densities. Herein a novel adaptive scheduling algorithm has been proposed. The algorithm incorporates novel round-robin sub-set allocation methods for two-tier and multi-hop networks. When compared to the previously proposed GUPTA algorithm, the two-tier Slotted Scheduler provides up to 226% longer lifetime and up to 30% greater imputation accuracy. In a multi-hop network, the Slotted Scheduling algorithm improves lifetime by up to 553% and can improve accuracy by up to 38% when compared with the GUPTA algorithm. It has been shown that re-scheduling can maintain the performance of the system over a long duration of time at a low increase in cost in terms of the number of transmitted packets. Performance results showed by retraining also show the importance of network load balancing, as the moment a node dies it can no longer be retrained.

ACKNOWLEDGEMENT
This research was funded by Enterprise Ireland under grant CFTD/07/IT/303.