Performance Analysis of Resource-Aware Task Scheduling Methods in Wireless Sensor Networks

Wireless sensor networks (WSNs) are an attractive platform for monitoring and measuring physical phenomena. WSNs typically consist of hundreds or thousands of battery-operated tiny sensor nodes which are connected via a low data rate wireless network. A WSN application, such as object tracking or environmental monitoring, is composed of individual tasks which must be scheduled on each node. Naturally the order of task execution influences the performance of the WSN application. Scheduling the tasks such that the performance is increased while the energy consumption remains low is a key challenge. In this paper we apply online learning to task scheduling in order to explore the tradeoff between performance and energy consumption. This helps to dynamically identify effective scheduling policies for the sensor nodes. The energy consumption for computation and communication is represented by a parameter for each application task. We compare resource-aware task scheduling based on three online learning methods: independent reinforcement learning (RL), cooperative reinforcement learning (CRL), and exponential weight for exploration and exploitation (Exp3). Our evaluation is based on the performance and energy consumption of a prototypical target tracking application. We further determine the communication overhead and computational effort of these methods.


I. INTRODUCTION
A wireless sensor network (WSN) is an attractive platform for various applications including target tracking, environmental monitoring, data aggregation and smart environments.The application is composed by tasks which need to be executed during the operation on the sensor nodes.The sensor nodes are typically supplied by batteries and thus pose strong limitations on energy but also on computation, storage and communication capabilities [1], [2], [3], [4].
The scheduling of the individual tasks has a strong influence on the achievable performance and energy consumption.WSNs operate in a dynamic environment where the need for adaptive and autonomous task scheduling is well recognized [5].Since it is not possible to schedule the tasks a priori, online and resource-aware task scheduling is required for a WSN.For determining the next task to execute, the sensor nodes need to consider the impact of each available task on the energy budget and the application's performance.There is trade-off between application performance and resource consumption, and the task scheduler of the node should be able to adapt to changes in the environment.For example, in a target tracking application, sensor nodes should frequently execute the tracking task when objects are within the field of view (FOV).Since tracking is very resource consuming, this task should be avoided when no object to track is nearby.Thus, task scheduling is an important issue to improve the energy/performance trade-off, and we investigate scheduling methods which are able to learn effective scheduling strategies in dynamic environments.We also investigate the effect of cooperation/communication among neighboring nodes with the local observations for task scheduling which is typical in a WSN.Cooperation among neighboring nodes has an impact on the overall application state and is able to further improvement on the energy/performance trade-off.Since resource-awareness is an important aspect we consider energy consumption for the tasks scheduling and aim for low resource consumption of the scheduling algorithms.
In this paper we apply online learning to task scheduling in order to explore the trade-off between performance and energy consumption.We compare resource-aware task scheduling based on three online learning methods: independent reinforcement learning (RL), cooperative reinforcement learning (CRL) and exponential weight for exploration and exploitation (Exp3).Our evaluation is based on a simulation study of the performance and energy consumption of a prototypical target tracking application.We further determine the communication overhead and computational effort of these methods.
The rest of this paper is organized as follows.Section II discusses related work, and section III introduces the problem formulation.Section IV describes the underlying system model for task scheduling based on online learning.In section V we present the three task scheduling methods.Section VI presents the experimental setup and discusses the simulation results for a target tracking application.Section VII concludes this paper with a summary and brief discussion on future work.

II. RELATED WORKS
In a resource-constrained WSN, effective task scheduling is very important for facilitating the effective usage of resources [6].The cooperative behavior among sensor nodes by exchanging data among neighboring nodes can be very helpful to schedule the tasks in a way that the energy consumption is reduced and also a considerable performance is maintained.Most of the existing methods of tasks scheduling in WSN do not provide online scheduling of tasks.They mainly consider static task allocation instead of focusing on distributed task scheduling.The main difference between task allocation and task scheduling is that task allocation deals with the problem of determining a set of task assignments on a sensor network that minimizes an objective function such as the total execution time [7], [8].On the other hand, the objective of task scheduling is to determine the best temporal order of tasks for each sensor node.In offline scheduling, the complete information about the system activities are available a priori, and the schedule can be determined at compile time.Due to the high dynamics of WSN complete system information is only available at runtime which requires online scheduling [9].In the following paragraphs we discuss related task scheduling approaches for WSN and stress the key differences to the presented approach.
Guo et al. [10] propose a self-adaptive task allocation strategy in a WSN.They assume that the WSN is composed of a number of sensor nodes and a set of independent tasks which compete for the sensors.They neither consider distributed tasks scheduling nor the trade-off among energy consumption and performance.
Giannecchini et al. [11] propose an online task scheduling mechanism called collaborative resource allocation to allocate the network resources between the tasks of periodic applications in WSNs.This mechanism does also not explicitly consider energy consumption.
Frank et al. [6] propose an algorithm for generic task allocation in wireless sensor networks.They define some rules for the task execution and propose a role-rule model for sensor networks where "role" is used as a synonym for task.It is a programming abstraction of the role-rule model.This distributed approach provides a specification that defines possible roles and rules for how to assign roles to nodes.This specification is distributed to the whole network via a gateway or alternatively it can be pre-installed on the nodes.A role assignment algorithm takes into account the rules and node properties, which may trigger execution and in network data aggregation.This generic role assignment approach does consider the energy consumption but not the ordering of tasks to sensor nodes.
Krishnamachari et al. [12] examine the channel utilization as resource management problem by a distributed constraint satisfaction method.They consider a wireless sensor network of n nodes placed randomly in a square area with a uniform, independent distribution.This work tests three self-configuration tasks in wireless sensor networks: partition into coordinating cliques, formation of Hamiltonian cycles and conflict-free channel scheduling.They explore the impact of varying the transmission radius on the solvability and complexity of these problems.In the case of partition into cliques and Hamiltonian cycle formation, they observe that the probability that these tasks can be performed undergoes a transition from zero to one.This constraint satisfaction approach neither addresses mapping of tasks to sensor nodes nor discusses the resource consumption/performance trade-off.
Dhanani et al. [13] compare utility-based information management policies in sensor networks.Here, the considered resource is information or data, and two models are distinguished: the sensor-centric utility-based model (SCUB) and the resource manager (RM) model.SCUB follows a distributed approach that instructs individual sensors to make their own decisions about what sensor information should be reported based on an utility model for data.RM is a consolidated approach that takes into account knowledge from all sensors before making decisions.They evaluate these policies through simulation in the context of dynamically deployed sensor networks in military scenarios.Both SCUB and RM can extend the lifetime of a network as compared to a network without maintaining any policy.This approach do not address the task scheduling to improve the resource consumption/performance trade-off.
Shah et al. [14] introduce a task scheduling approach for WSN based on an independent reinforcement learning algorithm (RL) for online tasks scheduling.They use Q learning [15] for the task scheduling.Their approach relies on a simple and fixed network topology consisting of three nodes and a static value for the reward function.They further consider neither any cooperation among neighbors nor the energy/performance trade-off.
In our previous work [16] we applied cooperative reinforcement learning (CRL) for online tasks scheduling.We used SARSA(λ) [17] learning and introduced cooperation among neighboring sensor nodes to further improve the task scheduling.In this paper we introduce exponential bandit solvers to online task scheduling, i.e., we apply Exp3 (exponential weight for exploration and exploitation) [18] which is an adversarial or non-stochastic bandit solver.We compare RL, CRL and Exp3 for the tasks scheduling in a target tracking application and analyze the performance in terms of tracking quality/energy consumption trade-off.The proposed approach also considers the cooperation where each node shares local observations of object trajectories with the neighboring nodes.

III. PROBLEM FORMULATION
In our approach the WSN is composed by N nodes represented by the set N = {n 1 , . . ., n N }.We abstract the deployment of the WSN by a simple 2D space where each node has a known position (u i , v i ) and a given sensing coverage range which is simply represented by a circle with radius r i .All nodes within the communication range R i can directly communicate with n i and are referred to as neighbors.The number of neighbors of n i is given as ngh(n i ).The available energy of node n i is modeled by a scalar value E i .
The WSN application is composed by A independent tasks (or actions) represented by the set Â = {a 1 , . . ., a A }. Once a task is started at a specific node, it executes for a specific (short) period of time and terminates afterwards.Each task execution on a specific node n i requires some energy Ẽj either for computation or communication and contributes to the overall application performance P .The energy consumption for computation and communication is represented by Ẽj for processing tasks and communication tasks, respectively.Thus, the execution of task a j on node n i is only feasible if E i ≥ Ẽj .The set Â is known at the initiation of the applications and does not change during the operation.The set of available nodes N can change during operation, e.g., due to completion of its energy source.The overall performance P is represented by an application-specific metric (cp.Section IV for more details).On each node, an online task scheduling takes place which selects the next task to execute among the A independent tasks.The task execution time is abstracted as a fixed period.Thus, scheduling is required at the end of each period which is represented as time instant t i .We only consider non-preemptive scheduling.
The ultimate objective for our problem is to determine the order of tasks on each node such that the overall performance is maximized while the energy consumption is minimized.

IV. SYSTEM MODEL
The task scheduler operates in a highly dynamic environment, and the effect of the task ordering on the overall application performance is difficult to model.Figure 1 depicts our scheduling framework where its key components can be described as follows: • Agent: Each sensor node embeds an agent which is responsible for executing the online learning algorithm.• Environment: The WSN application represents the environment in our approach.Interaction between the agent and the environment is achieved by executing actions and receiving a reward function.• Action: An agent's action is the currently executed application task on the sensor node.At the end of each time period t i each node triggers the scheduler to determine the next action to execute.d) Predict Trajectory: This function predicts the velocity of the trajectory.A simple approach is to use the two most recent target positions, i.e., (x t , y t ) at time t t and (x t−1 , y t−1 ) at t t−1 .Then the constant target's speed can be estimated as e) Intersect Trajectory: This function checks whether the trajectory intersects with the FOV and predicts the expected time of the intersection.This function is executed by all nodes which receive the "target trajectory" information from a neighboring node.Trajectory intersection with the FOV of a sensor node is computed by basic algebra.The expected time to intersect the node is estimated by where D PiPj is the distance between points P j and P i .P j represents the point where the trajectory is predicted at node j and P i corresponds to the trajectory's intersection points with the FOV of node i (cp. Figure 2).v is the estimated velocity as calculated by Equation 1. f) Goto Sleep: This function shuts down the sensor node for single time period.It consumes the least amount of energy of all available actions.

B. Set of states
We abstract the application by three states at every node.
• Idle: This state indicates that there is currently no target detected within the node's FOV and the local clock is too far from the expected arrival time of any target already detected by some neighbor.If the time gap between the local clock L c and the expected arrival time N ET is greater than or equal to a threshold T h 1 (cp.Figure 3), then the node remains in the idle state.The threshold T h 1 is set to 5 based on our simulation studies.In this state, the sensor node performs Detect Targets less frequently to save energy.• Awareness: There is currently also no detected target in the node's FOV in this state.However, the node has received some relevant trajectory information and the expected arrival time of at least one target is in less than T h 1 clock ticks.In this state, the sensor node performs Detect Targets more frequently, since at least one target is expected to enter the FOV.• Tracking: This state indicates that there is currently at least one detected target within the node's FOV.Thus, the sensor node performs tracking frequently to achieve high tracking performance.
Obviously, the frequency of executing Detect Targets and Track Targets depends on the overall objective, i.e., whether to focus more on tracking performance or energy consumption.The states can be identified by two application variables, i.e., the number of detected targets at the current time N t and the list of arrival times of targets expected to intersect Initially each node has no idea about which task to perform at which state.They learn this scheduling online over time.For example, Track Targets is necessary task for keep tracking when the target is in FOV.The application learns online about the next task to execute based on our proposed methods.If the sensor node does not perform the Track Targets task when the target is in FOV, there is a chance to miss the target which implies less tracking quality.But this situation could provide better energy efficiency, since Track Targets task consumes highest amount of energy among all the tasks.So, selection of a particular task at each time step or scheduling of tasks provides an impact on overall tracking quality/energy consumption trade-off.

C. Reward Function
The reward function is a key system component for expressing the effect of the task execution on the system performance and resource consumption.Thus, both aspects should be covered by the reward function.Among the various options we simply merge energy consumption and system performance using a balancing parameter.In detail, the reward function in our algorithm is defined as where the parameter β balances the conflicting objectives between E i and P t .E i represents the residual energy of the node.P t represents the number of tracked positions of the target inside the FOV of the node.E max is the maximum energy level of sensor node and P is the number of all possible detected target's positions in the FOV.These two parameters are used for normalizing the energy and performance parameters.By modifying the balancing parameter β we can control whether more focus is put on energy efficiency or system performance.

V. ONLINE LEARNING METHODS FOR TASK SCHEDULING
We use RL, CRL and Exp3 for the task scheduling in a multi-target tracking application.These three methods are online machine learning methods for task scheduling.RL and CRL are reinforcement learning methods.In RL, we do not exchange information among neighboring nodes.In CRL and Exp3, we exploit cooperation by exchanging trajectory information among neighboring nodes.In the following subsections we briefly describe the three learning methods and explain the their key parameters.For our experiments we abstract the tracking application with the same tasks, states and reward function.

A. Independent reinforcement learning (RL)
RL task scheduling follows the work of Shah et al. [14] which uses traditional Q learning [15] as online learning strategy.In Q learning the scheduling policy is represented by a two-dimensional matrix Q t+1 (s, a) indexed by stateaction pairs.The optimal Q value for a particular action in a particular state is the sum of the reinforcement received when that action is taken and the discounted best Q value for the state that is reached by taking that action [15].
The main idea of RL is to allow each individual sensor node to self-schedule its tasks and allocate its resources by learning their usefulness in any given state while honoring the application defined constraints and maximizing the total amount of reward over time.
In Q learning every agent needs to maintain a Q matrix for the value functions.Initially all entries of the Q matrix are zero and the agent of the nodes may be in any state.Based on the application defined variables, the system goes to a particular state.Then it performs an action which depends on the status of the nodes.It calculates the Q value for this (state, action) pair as where, Q t+1 (s t , a t ) means the update of the Q value at time t+1 after executing the action a at time step t. r t+1 represents the immediate reward after executing the action a at time t, V t represents the value function for node at time t and V t+1 represents the value function at time t + 1. max means the maximum Q value after performing an action from the action set A for the agent i. γ is the discount-factor which can be set to a value in [0, 1].For higher γ values, the agent relies more on the future than the immediate reward.α is the learning rate parameter which can be set to a value in [0, 1].It controls the rate at which an agent tries to learn by giving more or less weight to the previously learned utility value.When α is close to 1, the agent gives more priority to the previously learned utility value.Algorithm 1 depicts the RL algorithm.
Algorithm 1 Q learning for task scheduling.
1: Initialize Q(s, a) = 0.Where s is the set of states and a is the set of actions 2: while Residual energy is larger than zero do The CRL task scheduling follows a cooperative SARSA(λ) learning algorithm.SARSA(λ) [17], also referred to as State-Action-Reward-State-Action, is an iterative algorithm that approximates the optimal solution without knowledge of the transition probabilities which is very important for a dynamic system like WSN.At each state s t+1 of iteration t + 1, it updates Q t+1 (s, a), which is an estimate of the Q function by computing the estimation error δ t after receiving the reward in the previous iteration.The SARSA(λ) algorithm has the following update rule for the Q values: for all s,a.
In Equation 6, α ∈ [0, 1] is the learning rate which decreases with time.δ t is the temporal difference error which is calculated by following rule: In Equation 7, γ 1 is a discount-factor which varies from 0 to 1.The higher the value, the more the agent relies on future rewards than on the immediate reward.r t+1 represents the reward received for performing action.f is the weight factor [19] for the neighbors of agent i and can be defined as follows: An important aspect of an RL-framework is the trade-off between exploration and exploitation [20].Exploration deals with randomly selecting actions which may not have higher utility in search of better rewarding actions, while exploitation aims at the learned utility to maximize the agent's reward.
In our proposed algorithm, we use a simple heuristic where exploration probability at any point of time is given by where max and min define upper and lower boundaries for the exploration factor, respectively.S max represents the maximum number of states which is three in our work and S represents the current number of states already known.At each time step, the agent calculates and generates a random number in the interval of [0, 1].If the selected random number is less than or equal to , the system chooses a uniformly random task (exploration) otherwise it chooses the best task using Q values (exploitation).SARSA(λ) improves learning through eligibility traces e t (s, a) (cp.Equation 6).Here λ is another learning parameter similar to α for guaranteed convergence.γ 2 is the discountfactor.In general, eligibility traces give a higher update factor for recently revisited states.This means that the eligibility trace for a state-action pair (s, a) will be reinforced if s t ∈ s and a t ∈ a. Otherwise, if the previous action a t is not greedy, the eligibility trace is cleared.
The eligibility trace is updated by the following rule: Algorithm 2 depicts the cooperative SARSA(λ) learning algorithm.
Algorithm 2 SARSA(λ) learning algorithm for target tracking application.Select an action a, using policy 5: Execute the selected action 6: Calculate reward for the executed action (Eq.3)

Shift to next state based on the executed action 12: end while
The learning rate α is decreased slowly in such a way that it reflects the degree to which a state-action pair has been chosen in the recent past.It is calculated as where ζ is a positive constant.visited(s, a) represents the visited state-action pairs so far [21].

C. Bandit solvers (Exp3)
We use the classical adversarial algorithm Exp3 (Exponential-weight algorithm for Exploration and Exploitation) for the task scheduling [18].

9:
Calculate the updated probability distribution: 10: Shift to next state based on the executed action 12: end while In Exp3 the parameter κ controls the probability with which arms are explored in each round.At each time step t, Exp3 draws an action a according to the distribution P 1,t , P 2,t , . . ., P A,t .This distribution is a mixture of the uniform distribution and a distribution which assigns to each action a probability mass exponential in the estimated reward for that action.Intuitively, mixing the uniform distribution is done to make sure that the algorithm tries out all A actions and gets good estimates of the rewards for each action.
Exp3 works by maintaining a list of weights w i for each of the actions, by using these weights to decide which action to take next based on a probability distribution P t and by increasing the relevant weights when the reward is positive.The egalitarianism factor κ ∈ [0, 1] tunes the desire to pick an action uniformly at random.If κ = 1, the weights have no effect on the choices at any step.

VI. EXPERIMENTAL RESULTS AND EVALUATION
We evaluate the task scheduling methods using a WSN multi-target tracking scenario implemented in a C# simulation environment.The simulator consists of two stages: the deployment of the nodes and the execution of the tracking application.In our evaluation scenario the sensor nodes are uniformly distributed in a 2D rectangular area.A given number of sensor nodes are placed randomly on this area which can result in partially overlapping FOVs of the nodes.However, placement of nodes on the same position is avoided.Network parameters such as the number of nodes, the sensor radius and the transmission radius can be configured in our simulator.Once these network parameters are configured, we can run the simulation with our selected algorithm.
Targets move around in the area based on a Gauss-Markov mobility model [22].The Gauss-Markov mobility model was designed to adapt to different levels of randomness via tuning parameters.Initially, each mobile target is assigned with a current speed and direction.At each time step t, the movement parameters of each target are updated based on the following rule where S t and D t are the current speed and direction of the target at time t.S and D are constants representing the mean value of speed and direction.S G t−1 and D G t−1 are random variables from a Gaussian distribution.η is a parameter in the range [0, 1] and is used to vary the randomness of the motion.Random (Brownian) motion is obtained if η = 0, and linear motion is obtained if η = 1.At each time t, the target's position is given by the following equations: In our simulation we limit the number of concurrently available targets to seven.The total energy budget for each sensor node is considered as 1000 units.Table I shows the energy consumption for the execution of each action.We set the discount factors γ = 0.5, γ 1 = 0.5 and γ 2 = 0.5 for the online learning algorithms and vary the learning rate according to Equation 13.We set ζ = 1 for calculating learning rate in Equation 13.We set k = 0.25, min = 0.1, max = 0.3 and S max = 3 in Equation 10.We set λ = 0.5 for the eligibility trace calculation by Equation 12.We set the egalitarianism factor κ = 0.5 for Exp3.We consider the sensing radius as r i = 5 and communication radius as R i = 8.We set these fixed values for the parameters based on our simulation studies.For each simulation run we aggregate the achieved tracking quality and energy consumption and normalize the tracking quality and energy consumption to [0, 1].
For our evaluation we perform the following four experiments with the following assumptions of parameters.
1) To find out the trade-off between tracking quality and energy consumption, we set the balancing factor β of the reward function between [0.1,0.9] in 0.1 steps, keep the randomness of moving target as η = 0.5, set the egalitarianism factor of Exp3 as κ = 0.5 and fix the topology to five nodes.2) We vary the network size to check the trade-off between tracking quality and energy consumption.We consider three different topologies consisting of 5, 10 and 20 sensor nodes where the coverage ratio is 0.0029, 0.0057 and 0.0113, respectively.The coverage ratio is defined as the ratio of the aggregated FOV of all deployed sensor nodes over the area of the entire surveillance area.We keep the balancing factor β = 0.5 and the randomness of the mobility model η = 0.5 constant for this experiment.These values are measured from twenty iterations and represent the mean execution times and the mean of Send M essage task executions.Figure 4 shows the results of our first experiment.Each data point in these figures represents the average of normalized tracking quality and energy consumption of ten complete simulation runs.The results show the tracking quality/energy consumption trade-off for RL, CRL and Exp3 by varying the balancing factor β between [0.1,0.9] in 0.1 steps.We observe that CRL and Exp3 provide similar results, i.e., the corresponding data points are closely co-located.RL is more energy aware but is not able to achieve high tracking quality.
Figure 5 shows the results of our second experiment.In this experiment, each data point represents the average of normalized tracking quality and energy consumption of ten complete simulation runs by varying the network size to one of the values {5, 10, 20} for each methods.Here the same trend can be identified, i.e., the CRL and Exp3 achieve almost similar results in terms of tracking quality/energy consumption trade-off and RL shows less tracking performance with the higher energy efficiency.consumption of ten complete simulation runs by varying the randomness of moving objects η to one of these values {0.10, 0.15, 0.20, 0.25, 0.30, 0.40, 0.50, 0.70, 0.90} for each methods.From these figures, it can be seen that CRL and Exp3 outperform RL in terms of achieved tracking performance.We can see that for lower randomness, η=0.5, 0.7 and 0.9, RL and Exp3 show very close results for tracking performance.But for higher randomness, η= 0.1, 0.15 and 0.2, RL gives poor performance with regard to tracking performance.
Table II shows the comparison of RL, CRL and Exp3 in terms of average execution time and average communication effort.These values are derived from twenty iterations and represent the mean execution times and the mean of Send message task executions.We find that RL is the most resource-aware scheduling algorithm.Exp3 requires 25% more and CRL requires 86% more execution time, respectively.The communication overhead is similar for both Exp3 and CRL.

VII. CONCLUSION
In this paper we applied online learning algorithms for resource-aware task scheduling in WSNs.We analyzed and compared the performance of online task scheduling methods based on the three learning algorithms: RL, CRL and Exp3.Our evaluation results show that these methods provide different properties concerning achieved performance and resourceawareness.The selection of a particular algorithm depends on the application requirements and the available resources of sensor nodes.
Future work includes the application of our resource-aware scheduling approach to different WSN applications, the implementation on our visual sensor network platforms [23] and the comparison of our approach with other variants of reinforcement learning methods.

Fig. 1 .
Fig. 1.General framework for task scheduling using online learning.
A. Set of actionsWe consider the following actions in our target tracking application: a) Detect Targets: This function scans the field of view (FOV) and returns the number of detected targets in the FOV.b) Track Targets: This function keeps track of the targets inside the FOV and returns the current 2D positions of all targets.Every target within the FOV is assigned with a unique ID number.c) Send Message: This function sends information about the target's trajectory to neighboring nodes.The trajectory information includes (i) the current position and time of the target and (ii) the estimated speed and direction.This function is executed when the target is about to leave the FOV.

Fig. 2 .
Fig.2.Target prediction and intersection.Node j estimates the target trajectory and sends the trajectory information to its neighbors.Node i checks whether the predicted trajectory intersects its FOV and computes the expected arrival time.

Fig. 3 .
Fig. 3. State transition diagram.States change according to the value of two application variables Nt and N ET .Lc represents the local clock value and T h 1 is a time threshold.

Figure 3
depicts the state transition diagram where L c is the local clock value of the sensor node and T h 1 represents the time threshold between L c and N ET .

3 :
Determine current state s by application variables 4: Select an action a which has the highest Q value 5state based on the executed action 9: end while B. Cooperative reinforcement learning (CRL)

1 : 3 :
Initialize Q(s, a) = 0 and e(s, a) = 0 2: while Residual energy is larger than zero do Determine current state s by application variables 4:

Fig. 4 .
Fig. 4. Tracking quality/energy consumption trade-off for RL, CRL and Exp3 by varying the balancing factor of the reward function β.
η to one of the following values {0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.7, 0.9} and set the balancing factor β = 0.5 and fix the topology to five nodes.4) We evaluate RL, CRL and Exp3 in terms of average execution time and average communication effort.

Figures 6 ,
7 and 8  show the results of our third experiment.In this experiment, each data point represents the average of normalized tracking quality and energy

Fig. 5 .
Fig. 5. Tracking quality/energy consumption trade-off for RL, CRL and Exp3 by varying the network size.
[14]policy can focus more on exploration or exploitation depending on the selected setting of the learning algorithm.This function defines what is good for an agent over the long run.It is built upon the reward function values over time and hence its quality totally depends on the reward function[14].
• State: A state describes an internal abstraction of the application which is typically specified by some system parameters.In our target tracking application, the states are represented by the number of currently detected targets in the node's FOV and expected arrival times of targets detected by neighboring nodes.The state transitions depend on the current state and action.•Policy: An agent's policy determines what action will be selected in a particular state.In our case, this policy determines which task to execute at the perceived state.• Value function: