A Distributed Q Learning Spectrum Decision Scheme for Cognitive Radio Sensor Network

Cognitive spectrum management can improve the utilization efficiency of spectrum while increasing the energy consumption of sensor network nodes. Hence, how to balance the energy consumption and spectrum efficiency has become a critical challenge in the resource-constrained cognitive radio sensor networks. In this paper, by analyzing the channel characteristics and the energy efficiency of networks, a joint channel selection and power control spectrum decision algorithm based on distributed Q learning is proposed. To evaluate the performance of the proposed framework, an optimal Q value subject to communication efficiency index is formulated. Then, the learning strategy selection scheme is designed to solve the optimization problem by establishing a learning model. In this learning model, each node can get the strategy of other nodes to select the optimal strategy by introducing distributed strategy estimation. The simulation results show that the proposed algorithm has better performance than the existing methods.


Introduction
With the rapid development of wireless sensor networks, the traditional fixed spectrum allocation cannot meet the spectrum requirements of radio sensor network, so the cognitive radio sensor network (CRSN) arises at the historic moment [1], whose candidate characteristic is that the cognitive technique can be used for opportunistic spectrum access.But the dynamic spectrum management will increase the energy consumption of nodes while increasing network spectrum utilization [2], which is a severe challenge in CRSN with limited energy, storage, and computing resources.Therefore, how to ensure spectrum efficiency meanwhile without the energy efficiency loss is a critical issue in CRSN.
As the major part of spectrum management [3], spectrum decision is a crucial process in cognitive radio network [4], which can chose the best channels for secondary users to transmit data.Spectrum decision is usually divided into three steps [5], including channel characterization, channel selection, and parameter reconstruction.When the spectrum detected is available, the cognitive nodes will characterize the channel according to the observed local information and the channel statistical information of primary users.Then, nodes will select suitable channels according to these characteristics.Finally, we need to reconstruct the transmission parameters to adapt to the selected channel.
Due to the characteristics of spectrum holes [6], the behavior of primary user is changing over time, and cognitive nodes need to make spectrum decision dynamically to ensure the quality of communication [7].Therefore, it is very important to seek an efficient spectrum decision method.Current spectrum decision methods can be divided into two categories [8]: the non-load-balancing method and the load balancing method.
For the non-load-balancing spectrum decision method, cognitive nodes can determine the communication channel according to the channel conditions, such as traffic load [9,10], channel idle probability [11], the expected waiting time [12,13], the expected remaining idle period [13,14] or throughput expectation [15,16].Most methods have not considered spectrum sharing among cognitive nodes.If all cognitive nodes select the same frequency band for communication, there will exist serious channel competition [17].In order to solve this problem, some scholars have begun to research spectrum decision methods based on load balancing.

2
International Journal of Distributed Sensor Networks For example, in [18], a spectrum decision method based on game is proposed to balance load, which uses game to seek the optimal channel choice probability.In order to reach the Nash equilibrium, each node relates its utility function to the candidate channel and then calculates the channel selection probability of each channel by the best response algorithm.In [19], a game theoretic framework is proposed to evaluate spectrum decision functionalities in CRSN.The spectrum decision process is cast as a noncooperative game among secondary users who can opportunistically select the "best" spectrum opportunity, under the tight constraint not to harm primary licensed users.However, because the information of each network node is changeable, the player should change their strategies instantaneously to reach equilibrium.It leads to a slow convergence speed.In this context, some scholars have introduced learning methods to solve the spectrum decision problem.
In [20], a method of channel choice probability based on adaptive learning is proposed.By exploring the uncertainty of cognitive network traffic, cognitive nodes can select the optimal channel, but its convergence speed may be slow if the network scale is large.Shiang et al. [21] assume that cognitive nodes have different priorities and present a dynamic strategy learning (DSL) algorithm that dynamically adapts the channel selection strategy to maximize the private utility function of nodes.By using this method, the spectrum decision of the cognitive nodes can reach the equilibrium, but it should be noted that the equilibrium of above method is not the global optimal solution, because each node strategy for spectrum decision is independently.
Energy efficiency has been researched in the current spectrum decision method, but there are still many difficulties, such as how to balance the communication performance and energy consumption, how to reduce the communication overhead, and how to improve energy efficiency and enhance the adaptability; those difficulties limit the application of the existing spectrum decision methods.Therefore, it is particularly important to design a spectrum decision method for CRSN, which can fully improve the efficiency of the spectrum management.
In this paper, we consider current CRSN requirements.By analyzing the network channel characterization and energy efficiency, we design an adaptive spectrum decision framework and propose a joint channel selection and power control spectrum decision algorithm based on distributed Q learning.In this algorithm, each node should consider other nodes strategies when it selects the strategy and then make decision together with other nodes.To evaluate the performance of the proposed framework and balance the energy consumption and spectrum efficiency, an optimal Q value which is subject to communication efficiency index is formulated.Then, the learning strategy selection scheme is designed by establishing a learning model to solve the optimization problem.The effectiveness of the proposed framework is validated by simulations.
The remainder of this paper is organized as follows.Section 2 describes the system model and problem formulation.Learning model and algorithm implementation are discussed in Section 3. Simulation results and analysis are given in Section 4, followed by concluding remarks work in Section 5.

System Model and Problem Formulation
In this section, we describe the network architecture and formulate the optimization as a comprehensive evaluation index which is subject to communication efficiency index.

Network Model.
We consider a CRSN environment with some cognitive nodes.As shown in Figure 1.The network based on cluster structure and cluster nodes should cooperate with other nodes to determine the idle spectrum through spectrum sensing.Then, all the network nodes make spectrum decision together and the data is passed to the cluster head from network nodes within one hop, and the cluster nodes pass the data to the sink node with multiple hops.
In view of the cognitive wireless sensor network (CWSN), considering the general situation, the following assumptions are made throughout this paper.
(1) When the primary users is communicating, its transmission power is very high and CRSN nodes transmission power is relatively small, so, in this case, the network nodes cannot communicate with other nodes.
(2) Different cognitive wireless sensor node can be in the same channel for communication, but must adjust their own power to avoid interference with other nodes.
(3) In the process of spectrum decision, cognitive nodes do not need to exchange information with each other and select their communication channels and transmission power, respectively, so it can achieve the goal of energy conservation.
(4) We assume that the channel state transition probabilities, as well as the channel rewards, are unknown with the secondary nodes at the beginning.They are fixed throughout the learning, unless otherwise noted.Therefore, the secondary nodes need to learn the channel properties.
(5) We consider all the noise as Gaussian white noise, and the mean value is 0 and the variance is .

Problem Formulation
2.2.1.Channel Characterization.In order to select appropriate channel, the network nodes must describe the current characteristics of each channel and ensure the current status of its.In this paper, we mainly consider the bandwidth, signal interference, false alarm rate of spectrum detection, and the idle time of band.The idle channel is evaluated whether it is suitable for communication by a comprehensive evaluation index as the current state of the channel.The following factors will be considered to construct the comprehensive index: (1) channel bandwidth   : cognitive techniques nodes can detect the whole communicative spectrum and find the idle channel, but those channels may belong to different frequency bands.However, the channel division for different frequencies may be different, so the bandwidth is different and the channel capacity of idle channel is also different.The network node must consider the channel selection and power control according to different channel bandwidths; (2) signal interference  , :  , is the interference size of received signal at channel  in time , and it contains white noise interference and the interference of other nodes.Nodes can make detection according to the current channel interference, and  , shows that the greater the value means the worse the channel condition; (3) the band last free time  idle  :  idle  is the time interval between primary user appearances on band ; its value predicts the communication time of network nodes in this band.Because the network nodes expect continuous uninterrupted communication, so, in the channel selection process, network node will tend to select the channel of large idle time; (4) spectrum sensing of false alarm rate   : due to the fact that the primary user behavior is unpredictable, the spectrum sensing cannot be ensured completely.Different communication frequency shows different characteristics for shadow and fading, so the   of different frequency bands is different.
In this paper, we assume that the  , means comprehensive evaluation value on band  for time .Then, we proposed a multiobjective function as follows: where  1 ,  2 ,  3 ∈ (0, 1] is the weighting factor.In the third part of formula (1), 1 −   means the successful probability for the second user sense of the idle band and  idle  is the time interval between primary user appearances on band .So, (1−   ) idle  is the effective free time of idle band that can be used by the second user.

Energy Efficiency Analysis. Since cognitive wireless
sensor nodes can communicate successfully on the idle channel, the nodes need to adjust the transmission power and optimize the energy efficiency.Due to the multiple nodes communication on the same band, there may exist both Gaussian white noise and mutual interference of each node at the receiving end.On one hand, the network nodes need to increase its transmission power, in order to obtain higher signal-to-interference plus noise ratio (SINR) and higher transmission rate, and then can get a better QoS; on the other hand, the network nodes must reduce the transmission power to achieve the goal of energy conservation, reducing the interference to other nodes at the same time.Therefore, the communication efficiency index is proposed to consider both communication quality and energy consumption, which will be the input of learning algorithm to realize balance.
Compared with the primary user, the CRSN nodes transmit data with low power, so its communication range is small.In this paper, we assume that the communication of each cognitive node is completely sight path, namely, the wireless transmission model is a free space propagation model, in which the channel gain ℎ is as follows: where   is the transmitting antenna gain and   is the receiving gain;  is the speed of light, and  is the communication frequency;  is distance of receiver and transmitter.Assuming that   is the signal to interference plus noise ratio (SIRN) for receiver , then where   ,   ( min <   ,   <  max ) is the transmission power of nodes for transmitter  and ,  min and  max are the minimum and maximum thresholds of transmission power, ℎ  is the channel gain from transmitter  to receiver , ℎ  is the channel gain from transmitter  to receiver , and  is Gaussian noise power.To guarantee the QoS requirement, all the nodes need to make sure that its SINR must be greater than a certain threshold value  *  : In order to achieve the equilibrium between communication ability and energy consumption, this paper defines the average number of bits in unit energy as the communication efficiency index: where   is the communication bandwidth.
International Journal of Distributed Sensor Networks

Joint Spectrum Decision.
In this paper, we proposed a joint channel selection and power control spectrum decision, as shown in Figure 2. Firstly, the network nodes must describe the current characteristics of each channel and determine the current status which is considered as the input of distribute Q learning.In order to guarantee the network communications QoS constraints and minimize the energy consumption of the network nodes, this paper considers both channel switching and energy efficiency to design return value and then calculates the instant return values for different network conditions.Finally, we realize the joint channel selection and power control spectrum decision by introducing distributed Q learning algorithm.
In order to balance network communication ability and energy efficiency, we formulate the optimization as follows: where  * means the optimal Q value and Φ *  is the minimum requirement of communication efficiency index.

Learning Model and Algorithm Implementation
In order to realize the balance of communication quality and energy consumption and optimize the network communication ability in the prerequisite of communication efficiency index, this section presents an adaptive spectrum decision based on distributed Q learning.

Learning Algorithm Analysis.
Reinforcement learning is an on-line technique [22] that considers environmental feedback as the input and learns through constant interaction with the environment, then uses the feedback signal to find the optimal action which adapted to the current environment.Reinforcement learning systems mainly include two parts [23]; they are both environment and agent, and the basic framework is shown in Figure 3.As a model irrelevant learning algorithm, Q learning mainly cares about the evaluation value   and selects the optimal state action according to (, ).Usually, we consider  * (, ) as the optimal evaluation value and  * (, ) as the optimal strategies.Assuming that the state set is   and the action set is   , the evaluation value at the next time  + 1 can be calculated by the formula as follows: where  is the discount factor,   is the learning rate,  +1 is the return value at next time, and  , (  ,   ) is the value function of state action for node ; its means are the sum of the return value by executing action   at state   .
In the strategy selection of learning algorithm, there exists a balance problem of exploration and exploitation.Exploration means the agent continuously updates learning knowledge to find the better strategy; exploitation means the agent selects the optimal action from all the action.In order to solve the balance problem, there are some algorithms like  greedy algorithm and soft-max algorithm [24].The  greedy algorithm adopts random way to search.All the actions are chosen coequally.It means the worst action and the optimal action is chosen with the same probability, which will reduce the efficiency of learning algorithm.However, the soft-max algorithm sets different strategies according to different Q value, and it uses Boltzmann distribution to define actionselection probabilities:  , (  ,  , ) =   , (  , , )/ ∑ ∈   , (  , , )/ , ( where factor  > 0 specifies how randomly values should be chosen.High values for  means that the actions will be chosen almost uniformly.As it reduced, the highest valued actions are more likely to be chosen, and when the limit is  → 0, the best action is always chosen.Due to the cognitive wireless sensor nodes that locate in the same network environment, all of the network resources are competed equally.Therefore, the behavior of each node will affect other nodes in spectrum decision, and other nodes may also affect the strategy of this node.So, we must consider the strategy of other nodes in the strategy selection.The formula is shown as follows: In order to get the strategy of  , (  ,  , ), each node must know the strategy of other nodes, and the nodes need to exchange information with each other to get the optimal strategy.In the environment of cognitive radio sensor network, the node can observe and record the action of other nodes then use the history information to estimates the action of other nodes in next time and select the optimal strategy.In this paper, we present a method to estimate the strategies of other nodes, as shown in the following formula: where  is factorial estimation,  ,− (  ,   ) is function with two values, 1 and 0. If the node  selects action   at state   in time  − , then  ,− (  ,   ) = 1; otherwise,  ,− (  ,   ) = 0. Formula (10) can be simplified as follows: With the time increase, (1 −  −1 )/(1 −   ) → 0, so, we can simplify formula (11)  State   .The first step is to identify what a system state represents.For example, a state can be a combination of currently active network optimization services.Duration of the learning process directly depends on the number of states.As such, reducing their number will speed up the decision process.In this paper, the network environment sate is   = [  ,  , ], where   = { 1 ,  2 , . . .,   } is a set of detected available frequency band,  , is the communication state of the band, namely, the comprehensive evaluation value in formula (1).
Action   .Each state has one or more associated actions.Any change of a network property (selecting the transmission channel, changing communication power, etc.) is considered as an action.Consequently, the number of available actions at each state will depend on the number of properties that can be modified and the number of distinct values that can be assigned to them and possible constraints defined by the system architect.In this paper, the action of network nodes is   = [  ,   ], where   is the available channel which is selected by nodes to communicate and   is the communication power.Each network node can switch to a better channel or adjust the transmission power of the current channel.
Reward Function   .Rewards are assigned with the intention to reinforce specific state-action pairs and can be positive or negative.Due to the fact that network nodes can adjust the action immediately according to the received rewards, choosing and defining the rewards are challenging at times.In this paper, we designed different reward values for different network conditions.Consider

Input:
The initial learning rate   , discount factor , state sets , action sets .
(2) Each node obtains the network state information of current available channel; (3) Calculate the comprehensive evaluation value according to the formula (1) to determine the state input of learning algorithm; (4) If the network state changes, then skip to Step 5, and need to select the channel and power again, otherwise skip to Step 8, and the nodes can communicate normally; (5) Calculate the reward value immediately and update the learning rate, then use the formula (7) to update the Q value table; (6) Record and estimate the strategy of other network nodes, and use the formula (12) to update the node strategy; (7) According to current strategy select the optimal action   , namely communication channel and power; (8) If the quality of the selected channel is poor, namely Φ  < Φ *  , need to switch the channel and return to Step 5, and now the reward value is −0.1; (9) If the primary user appears in the communication process, then need to return to Step 2 and determine the idle spectrum set and network state again; Algorithm 1: The joint spectrum decision of channel choice and power control with distributed Q learning.
(1) collision with primary user: due to the fact thatthe behavior of primary user is unpredictable and spectrum detection has a certain error, network node may conflict with the primary user when it selects a channel for communication.Then, we define the reward value as the lowest −0.5; (2) channel switching: when the interference of current communication channel increased or the primary user suddenly appeared, nodes require switching of the channel.But switching the channel frequently will lead to excessive energy consumption, so we should avoid the channel switching times and set the reward value as −0.1; (3) power adjustment: when the network nodes are working in a normal communication channel, it declare that this channel can satisfy the communication conditions of nodes, and nodes only need to adjust their power to achieve QoS and energy consumption constraints.So, we define the communication efficiency index in formula (5) as the reward value, but the SINR must satisfy constraint (4); otherwise, the reward value is 0.
Integrating all the situations, reward function   can be defined as follows: After the nodes obtain the free frequency band by cooperation, each node has competition with other nodes in the available channel for data transmission.The environment of CRSN is dynamic and complex networks, whose states are affected by many factors, which require network nodes to adjust the communication parameters adaptively and reduce the interference of each node as far as possible and then maximize the network energy efficiency under the demand of a certain communication at the same time.The transmission power is also different when the nodes work in different channel for wireless communication environment.Good channels only need a little transmission power, while poor channels need to increase transmission power to guarantee transmission.Therefore, the joint decisions for communication channel and transmission power can meet the demand of communication and energy efficiency.
In this paper, a joint spectrum decision of channel choice and power control based on distributed Q learning is proposed.By considering the energy efficiency and QoS constraints and reducing energy consumption as far as possible, we can achieve the purpose of extending network survival time.In order to save energy and reduce the communication overhead, this paper adopts the distributed Q value updating method, as shown in formula (7).
The joint spectrum decision of channel choice and power control with distributed Q learning is shown in Algorithm 1, and the flat chat of this algorithm is shown in Figure 4.

Analytical and Simulation Result
In this paper, we assume that there are six fixed clusters and a sink node in the CRSN, where each cluster is made up of 10 nodes, and each node selects appropriate channel and transmission power for data transmission.The radius of cluster is 70 m, and each cluster contains three primary users, whose transmission power is larger than other cognitive nodes.Assume that there are 20   Gaussian white noise is  = 10 −7 mW, and the sending and receiving antenna gain is 2. The CRSN node transmission power set is {100 mW, 125 mW, 150 mW, 175 mW, 200 mW}.With parameter values  = 0.9,  = 0.005, and  0 = 0.1.
In this section, we will use the following performance indicators to evaluate the performance of the proposed algorithm, which will be compared with game algorithm and dynamic strategy learning algorithm, respectively: The index reflects the advantages and disadvantages of the strategy and algorithm.
In Figure 5, we can see the changes of energy efficiency for different algorithms.Each algorithm will reach the convergence as time goes on.And we can find that the dynamic strategy learning algorithm can reach convergence quickly, but since there is no consideration about other nodes' strategies when the node selects strategy, it cannot realize the high energy efficiency.The spectrum decision based on game is better than dynamic strategy learning in energy efficiency but has inferior performance compared to the distribute Q learning, due to more information exchanged and more iteration needed in game framework.From Figure 5, it is shown that the distributed Q learning algorithm has the best energy efficiency and fastest convergence, because it considers the selection influence of both channel and power.
Figure 6 is the average channel switching times, it shows that the channel switching times of the proposed algorithm are convergent to 0.9, while dynamic strategy learning is convergent to 1.4 and the algorithm based on game is convergent to 3.6.Spectrum decision based on game algorithm requires communication between each node to follow the changes of nodes' information, so it needs most of the channel switching times.For the dynamic strategy learning method, because it did not consider the strategy of the other nodes, it may not choose the optimal channel, requiring the adjusting of the channel from time to time.The proposed algorithm has   carried on the comprehensive evaluation and also considered the estimation strategy of other nodes for channel and power choice.Thus, it can choose the best channel and power, which has the fewest channel switching times and then reduce the energy consumption.
The comparison of the average throughput in CRSN is shown in Figure 7.It is shown that the proposed algorithm has the best network performance, and its throughput is superior to other algorithms.Compared with other algorithms, the proposed algorithm let each node get other nodes' strategies to select the optimal strategy by introducing distributed strategy estimation.Thus, it can select optimal channel and transmission power quickly, improving the rate of the data transmission.Therefore, the proposed algorithm can provide better QoS guarantee for CRSN.
As shown in Figure 8, the algorithm based on game can reach 82.8%, the algorithm based on dynamic strategy learning can reach 87.6%, and the proposed algorithm can achieve International Journal of Distributed Sensor Networks 93.8%.Because there existed strategy estimation between the nodes when selecting channel and power, the proposed algorithm can get the optimal selection strategy.Thus, it can achieve high success rate for data communication in CRSN.However, the dynamic strategy learning and game algorithm do not consider other nodes' strategies, so it cannot guarantee the global optimum and have inferior transmission rate compared to the proposed algorithm.

Conclusion
In this paper, we consider the requirements of current CRSN and design an adaptive spectrum decision framework by analyzing the network channel characterization and energy efficiency.To balance the energy consumption and spectrum efficiency of this framework, we adopt a distributed Q learning algorithm to implement channel selection and power control jointly, which takes channel state as the input and takes the selected channel and transmit power as the output.By using this algorithm, the network nodes can get the optimal transmitted power and communication channel to guarantee the energy efficiency and spectrum efficiency simultaneously.Future works will focus on the restraining the interference of data transmission between secondary nodes when selecting the idle channel.

Figure 1 :
Figure 1: The network model of CRSN.

Figure 2 :
Figure 2: Distributed Q learning based energy efficiency optimization with joint channel selection and power control spectrum decision.

Figure 3 :
Figure 3: The interaction process of reinforcement learning.

Figure 4 :
Figure 4: The flow chats of the joint spectrum decision algorithm based on distributed Q learning.

Figure 6 :
Figure 6: The average channel switch times of network node.

Figure 7 :
Figure 7: The average throughput of network node.

Figure 8 :
Figure 8: The successful transmission probability of network node.
as the following update algorithm: π, (  ,  , ) = 1 −  1 −      ,0 (  ,  , ) + π,−1 (  ,  , ) .(12)  , (  ,  , ) = π, (  ,  , ) π, (  ,  , ) + ∑ ∈, ̸ =  , π, (  ,  , )In this paper, we consider each network node as an agent which can select communication channel and transmission power adaptively.This dynamic adjustment process can be defined as a Markov decision process (MDP), and its model is made up of a triple  = ⟨, , ⟩, where  is the state set of finite network nodes,  is the action set of finite network nodes, and  : × → R is reward function, and (, ) signifies the reward of taking action  in state .According to the current network environment, network nodes select appropriate communication channel and transmission power to transmit data.Now, we define state, action, and reward function, respectively.