A randomized block policy gradient algorithm with differential privacy in Content Centric Networks

Policy gradient methods are effective means to solve the problems of mobile multimedia data transmission in Content Centric Networks. Current policy gradient algorithms impose high computational cost in processing high-dimensional data. Meanwhile, the issue of privacy disclosure has not been taken into account. However, privacy protection is important in data training. Therefore, we propose a randomized block policy gradient algorithm with differential privacy. In order to reduce computational complexity when processing high-dimensional data, we randomly select a block coordinate to update the gradients at each round. To solve the privacy protection problem, we add a differential privacy protection mechanism to the algorithm, and we prove that it preserves the ε -privacy level. We conduct extensive simulations in four environments, which are CartPole, Walker, HalfCheetah, and Hopper. Compared with the methods such as important-sampling momentum-based policy gradient, Hessian-Aided momentum-based policy gradient, REINFORCE, the experimental results of our algorithm show a faster convergence rate than others in the same environment.


Introduction
Content Centric Network (CCN) shows great potential in the future Internet development; many multimedia applications in CCN are policy decision problems in essence, for example, traffic prediction and resource allocation. Reinforcement learning 1-3 is an effective means to deal with optimization decision problems; it is applied in many network fields, such as congestion control, 4 traffic scheduling, 5 network security, 6 and load balancing. 7 The principle of reinforcement learning is an iteration decision process in which an agent constantly interacts with the environment to strengthen its decision-making ability, which mainly solves the sequence decision problem. At each iteration, an agent changes the state of the environment by performing an action, and the environment feeds it back a reward based on a reward function. The objective of reinforcement learning is to obtain a policy that maximizes cumulative reward. Therefore, how to design a better policy to achieve this goal is very important.
As an important branch of reinforcement learning, policy gradient methods are widely used in many fields, 1 School of Information Engineering, Henan University of Science and Technology, Luoyang, China 2 Internet of Things & Smart City Innovation Platform, Zhuhai Fudan Innovation Institute, Zhuhai, China such as video game, 8 AlphaGo, 9 news recommendation, 10 and network resource. 11,12 Policy gradient methods have a better performance in continuous action space (or higher dimensional space), and they can implement a randomized strategy. Policy gradient methods directly parameterize the policy and learn based on the strategy, which defines a policy p as a function with parameter u; the policy p is a probability distribution, indicating the probability distribution of choosing different actions in a certain state. After the policy is represented as a continuous function, we can use the continuous function optimization method to update the parameters; policy gradient optimization algorithms are becoming the major research areas. The most classic optimization algorithms are REINFORCE 13 , Policy Gradient Theorem (PGT) 14 , and Gradient of a Partially Observable Markov Decision Process (GPOMDP). 15 These methods use the policy p u (ajs) with the parameter u to sample, and then they estimate the gradient and update the parameter u. However, these algorithms have a poor convergence rate because of the large variance in gradient estimation and sampling.
In order to reduce the variance in gradient estimation, Stochastic Variance Reduced Gradient Descent (SVRG) 16 uses a batch sample to estimate and calculate the gradient of all the samples every once in a while; although SVRG is very successful in supervised learning, it is difficult to apply it to policy gradient. To address the problem, Stochastic Variance Reduced Policy Gradient (SVRPG) 17 is a random variance reduction algorithm of the policy gradient used to solve the Markov Decision Process (MDP). SVRPG uses the importance sampling weight to retain the unbiased gradient estimation, which can ensure convergence under the standard assumption of MDP. But the above algorithms have a high sample complexity and large training batches; then important-sampling momentumbased policy gradient (IS-MBPG) and Hessian-Aided Momentum-based Policy Gradient (HA-MBPG) 18 combine momentum methods [19][20][21] with the important sampling and Hessian-aided methods, which achieves a faster convergence rate with an adaptive learning rate, and IS-MBPG reduces the sample complexity of e approximate point of stability to O(1=e 3 ).
However, the algorithms mentioned above calculate the full gradient for all dimensions of data in each iteration, which causes a large amount of calculation. And these algorithms use centralized processing methods in the data training, which upload data to the central node, reducing communication resources. In addition, the data training may involve sensitive data, which causes the privacy breach.
In order to solve the above problems, we propose a randomized block policy gradient algorithm with differential privacy (DP-RBPG) in CCN, which combines the random block-coordinate methods and differential privacy to meet the demand of high-dimensional data processing and privacy protection. At each iteration, we randomly select one block coordinate to update the gradients, which decreases the variance of gradient estimator and accelerates the convergence rate. And if we use all block coordinates in gradient updating, our algorithm will reduce to IS-MBPG. DP-RBPG can also reduce the sample complexity of e approximate point of stability to O(1=e 3 ); the whole training process does not need large batches, and the comparison on sample complexity with other algorithms is shown in Table 1.
Meanwhile, we achieve the differential privacy protection with increasing a Laplace distribution noise perturbation in the process of gradient updating, which increases the security of the training process and solves the differential privacy leakage problem in the process of data transmission.
The main contributions are listed as follows.
We propose a DP-RBPG based on momentum and important sampling methods. DP-RBPG accelerates gradient descent and decreases the variance of gradient estimator without sacrificing the performance, and we introduce a differential privacy protection mechanism during data processing. We add the noise interference into updating gradient, which follows the Laplace distribution. Based on the property of Laplace distribution, we prove that it can preserve the e-privacy level.
We implement DP-RBPG in MuJoCo environment, which contains CartPole, Walker, Hopper, and HalfCheetah. The experimental results show that DP-RBPG improves the convergence rate, and as the data dimension goes up, the algorithm has a better performance. Meanwhile, as the  level of e-privacy increases, our algorithms will decrease slightly in experiments.
The article is organized as follows. We review the related works in the ''Related work'' section. The preliminaries of policy gradient and reinforcement learning are described in the ''Preliminaries'' section. Our algorithm is proposed in the ''Differential privacy randomized block policy gradient'' section. Environment deployment and results of the contrast experiments are described in the ''Experiments'' section. Finally, we give a conclusion about the article in the ''Conclusion'' section.

Policy gradient
Recently, decreasing the variance of gradient estimator is the main way to study policy gradient. Le Roux et al. 23 proposed stochastic average gradient (SAG) algorithm, which used two gradients: one is the gradient of the previous iteration and the other is the new gradient-both of these gradients selected a sample randomly to calculate. The convergence rate of SAG was faster than the stochastic gradient descent (SGD) of the gradient estimated from a single sample. Defazio et al. 24 proposed stochastic average gradient descent (SAGA), an accelerated version of the SAG algorithm, which used an unbiased estimator to update gradient and reduced the impact of noise. However, the problems of SAG and SAGA were that they required memory to maintain every old gradient. Xu et al. proposed SVRG, which used a batch sample to estimate the gradient and estimated gradient with the batch samples over a period of time; after then, it reselected a batch sample to estimate the gradient. SVRG attained a faster convergence rate than SAG, but it could not guarantee the unbiased estimation of gradient. Allen-Zhu 25 proposed a new algorithm named for nonconvex stochastic optimization problems, which divided the internal cycle of SVRG into n sub-epochs and used the structural information of strong nonconvex functions effectively, so that it is more efficient than SVRG in stochastic optimization. Fang et al. 26 put forward stochastic pathintegrated differential estimator (SPIDER) algorithm for finding first-and second-order stable points of nonconvex stochastic optimization, and they proved that the SPIDER algorithm was optimal in solving nonconvex random optimization algorithms. Nguyen et al. 27 proposed a new random recursive gradient algorithm StochAstic Recursive grAdient algoritHm (SARAH), which used a recursive framework to update the gradient. Different from SAG and SVRG algorithms, it did not need the pass gradients to update the gradient estimator, which saved memory space, so that the convergence rate of SARAH is faster than others. However, all of these algorithms only did well in oblivious supervised learning, not reinforcement learning.
More recently, Papini et al. 17 came up with a new reinforcement learning algorithm named SVRPG, which was applied to policy gradient. This method decreased the sample complexity and converged faster. Xu et al. proposed a better convergence analysis method than SVRPG; the sample complexity of e approximate point of stability was reduced to O(1=e 5=3 ). Shen et al. 22 proposed HAPG, which combined a Hessian aided with policy gradient, and HAPG reduced the sample complexity of e approximate point of stability to O(1=e 3 ); meanwhile, the method could apply in existing reinforcement learning techniques, and the performance of Hessian aided policy gradient (HAPG) was better than SVRPG. Xu et al. 28 proposed the stochastic recursive variance reduced policy gradient (SRVR-PG), which reduced the sample complexity of e approximate point of stability to O(1=e 3=2 ). Huang et al. put forward IS-MBPG and HA-MBPG, which were based on important sampling and momentum methods. IS-MBPG and HA-MBPG improved the sample complexity of e approximate point of stability to O(1=e 3 ) with small sample batches, and they achieved adaptive learning rate by adjusting step size in these algorithms. But the above algorithms are a poor convergence in high-dimensional data.

Randomized block coordinate
Randomized block methods selected one coordinate to update the gradient in each iteration, which could reduce iteration costs and memory requirements and speed up convergence rate. 29 In order to handle large training tasks, Diakonikolas and Orecchia 30 proposed an algorithm named accelerated alternating randomized block coordinate descent (AAR-BCD), which optimized the method of random block selection. Zhao et al. 31 proposed a mini-batch randomized block coordinate descent (MRBCD) algorithm, which updated the gradient with mini-batch samples in each round, and the variance of gradient estimator in MRBCD was reduced and the convergence rate was accelerated. Lacoste-Julien et al. 32 proposed a randomized block algorithm to solve the problem of block-separable constraints, which was mainly used in support vector machines (SVMs). Singh et al. 33 improved the Nesterov method with gradient projection methods, which accelerated the convergence rate. Lin et al. 34 put forward a new method to analyzing the problem of asynchronous distributed optimization. Based on the above observation, we select randomized block-coordinate method to update the gradients, which can decrease the variance of gradient estimator and accelerate the convergence rate.

Differential privacy
Differential privacy methods resist differential attacks by adding a random mechanism to achieve the purpose of privacy protection. 35,36 Gao and Ma 37 proposed an algorithm which combines reinforcement with differential privacy in processing dynamic data. Ding et al. 38 put forward a alternating direction method based on differential privacy and reinforcement. Cheng et al. 39 proposed a novel stochastic gradient descent algorithm with deep learning and differential privacy. Dai et al. 40 utilized reinforcement learning to solve the problem of network security. Whereas in processing highdimensional data, these algorithms are lacking.

Preliminaries
In this section, we will introduce some preliminary knowledge about reinforcement learning and policy gradient methods.

Reinforcement learning
The most basic model of reinforcement learning is MDP, which consists of a set of environmental states S, a set of actions A, a set of rewards R, a discount factor g 2 ½0, 1, and the transition probability between states P, which takes the form The MDP can be described as follows: an agent whose initial state is s 0 selects an action a 0 from A to execute, the agent randomly transfers to the next state s 1 with a probability P. We define p as the set of action policy; p(ajs) is the probability of taking the possible action a for state s in the process, which is expressed as The goal of the MDP is to find an optimal policy p, which is a mapping function from the state to action, to get the maximum reward R p s = P a2A p(ajs)R a s from the environment.

Policy gradient
Different from the value-based methods, policy gradient methods are based on a policy to learn, which outputs the action or the probability of the action directly based on the state. Comparing with the parameterized representation in the value function, the policy parameterization is simpler and has better convergence. Analyzing from the perspective of importance sampling, the objective of policy gradient is maximizing cumulative return. s = fs 0 , a 0 , . . . , s N , a N g represents a set of statebehavior trajectory sequences, where the probability of emergence of trajectory s is p(sju), the trajectory reward is R(s) = P N t = 0 g t R(s t , a t ), and the expected cumulative reward of a parametric policy is defined as During the training, the objective of policy gradient is to find optimal parameter u, which can be described as Based on formula (4), we turn the policy search approach into an optimization problem. The methods solving this problem include the policy gradient, Newton's method and Quasi-Newton Methods, and Interior Point Method. The policy gradient is the simplest and most commonly used, which uses u 0 = u + ar u f (u) to update the parameter u; we first take the derivative of the target function f When calculating the policy gradient, the data used are sampled under the new strategy, which requires that all the samples should be resampled according to the new strategy after each gradient updation. The data utilization of this approach is very low, which makes the lower speed of convergence. Therefore, we bring in the concept of importance sampling, using the old parameter u to compute the expected return of the para- Taking the derivative of this formula, we can get the same result as equation (5). But the trajectory probability p(sju) is not known, and the result of equation (6) is difficult to get. Combining stochastic gradient descent method, we select a batch of trajectories M = fs i g jMj i = 1 from the trajectory distribution; meanwhile, we definerf (u) as an estimated value of the rf (u), which takes the form Based on equation (7), we add the learning rate j t to update the parameter u where the learning rate is greater than zero and u 2 R d , R d means that the d dimensions positive number set. At the beginning, the algorithm adopts a large learning rate; when the error curve enters the plateau stage, the learning rate is reduced to make more precise adjustment. We adopt an unbiased estimator v(s, u) based on the trajectory s i , which meets the needs of E½v(s i , u) = rf (u); based on equations (5) and (7), we can get so that r log p(sju) = P N n = 0 log p s (a h js h ), and thê rf (u) can be rewritten as the form However, there are some problems with equation (9); the lager cumulative return value will make the larger parameter update, so that the model will fluctuate greatly, which may affect the final model effect. Although the policy is an unbiased estimator of expectation, the variance is very large due to overdependence on each sampling trajectory, so we bring in the baseline c to reduce the variance aŝ We can prove that E½r u log p(sju)c = 0 as So that the baseline c is to reduce the variance without changing the expectation.
The current strategy gradient algorithm has high computational cost when processing high-dimensional data, and privacy leakage has not been considered, but privacy protection is very important in data training. Therefore, we propose the differential privacy randomized block policy gradient algorithm. In order to reduce the computational complexity of processing high-dimensional data, a random block coordinate is selected randomly to update the gradient of each round. In terms of privacy protection, a differential privacy protection mechanism is added to the algorithm, and it is proved that it maintains e-privacy level.

Differential privacy randomized block policy gradient
In this part, we combine the randomized coordinate method and differential privacy with important sampling momentum-based policy gradient method (DP-RBPG). The randomized coordinate way does well in dealing with optimization problems, especially in higher dimensions. The implementation of the method is shown in Algorithm 1. Due to the trajectory probability, v(sju) in equation (5) is unable to calculate directly; we bring in the randomized coordinate method and important sampling to reduce the variance. At each round, we randomly select a coordinate block to update the gradients. Meanwhile, we bring differential privacy in our algorithm, which adds a noisy that obeys Laplace distribution, and we prove that it preserves the e-privacy level. Now we introduce the definitions related to differential privacy as follows.

Algorithm 1: DP-RBPG
Input: Total iteration number T, constant parameters fh, n, ag and initial input u 1 .
Definition 1. We first define the concept of adjacent relation, we introduce the adjacent data sets M = fx 1 , . . . , x n g and M 0 = fx 0 1 , . . . , x 0 n g. These two data sets have one and only one data set that is different, which means that there is only one i such that x i 6 ¼ x 0 i , and other i 0 2 f1, . . . , ng such that Definition 2. Then we bring in the definition of differential privacy. For a randomized algorithm B, which means that the output S Range(B) of the algorithm is not fixed but obeys a certain distribution, we define the differential privacy as where e is a smaller positive number, M, M 0 means different data sets, and Range(B) is the output range of randomized algorithm B. The objective is to prove that the randomized algorithm B preserves the e-privacy level, and the smaller the e, the higher the degree of privacy protection.
Definition 3. We introduce the definition of the randomized algorithm B global sensitivity as where k B(M t ) À B(M 0 t )k 1 denotes the Manhatton distance between B(M t ) and B(M 0 t ) at time t, and the global sensitivity reflects the maximum range of variation of a randomized algorithm in a pair of adjacent data sets.
Formally, for t = 1, the policy gradient takes the form where h k (t) obeys the Laplace distribution Lap(q(t)) with parameter q(t); it denotes that the kth dimension number in the noisy, h k (t), has the same data dimension as the gradient w. More generally, we define the momentum-based gradient as where we select kth coordinate of decision to update the policy gradient, and the coordinate variable k is independent of t, u, s. It generates the gradient w t, k = ½w t, 1 , w t, 1 , . . . , w t, n . And we also use kth number of noisy as the distractions. u(s t ju 0 , u t ) is the important sampling weight; based on the fundamental apply of important sampling in policy gradient in equation (6), we use a known probability distribution to get it and we add the important sampling weight as p u 0 a n js n ð Þ p u a n js n ð Þ By equation (16), we can achieve that E s;(sju) ½v(sju) À u(sju 0 , u)v(sju 0 ) = rf (u) À f (u 0 ), which decreases the variance on the basis of constant expectation in the process of gradient calculation. As shown in Algorithm 1, we define the norm of the gradient v(sju t ) as G t , and then the learning rate is j t , which implements adaptive adjustment.
According to the proof method in IS-MBPG, we can also get the similar conclusion that the e approximate point of stability with our algorithm is O(1=e 3 ).
Then we will prove that DP-RBPG preserves the edifferential privacy level. We define the b t, k as For the kth coordinate, w i t, k means that the coordinate i of w t, k and the gradients w i t, k and b i t, k have n k dimensions, which means that n 1 + Á Á Á + n m = n k . Based on Definition 3, we can get the conclusion that Next, we will analyze the privacy property of our algorithm.
Theorem 1. Based on Algorithm 1, w t, k are independent and identically distributed variables drawn from the Laplace distribution with parameter q(t). If the parameter q(t) makes q(t) = D(t)=e for all t 2 f1, . . . , T g, we can prove that our algorithm guarantees the e-privacy level.
The first inequality above uses the triangle inequality, and the second inequality uses equation (19) to prove; when the parameter q(t) makes D(t)=q(t) = e, we can prove that it preserves the e-privacy level. Therefore, we finish the proof of Theorem 1.

Experiments
In this part, considering about the multimedia data transmission mechanism in CCN, we train our algorithm offline in a lot of epochs. Meanwhile, we mainly introduce the experimental results with our algorithm in four simulation environments, which are CartPole, Walker2D, HalfCheetah, and Hopper. We compare our algorithm with IS-MBPG, REINFORCE, 14 and HA-MBPG 18 in these simulation environments. The detailed environmental setup and the analysis of experimental results are shown below.  45 In the process, we set the same initial value for all algorithms, aiming at the problem of data randomness that may exist in the experiment; we repeated the experiment many times and selected the average result as the final result.

Experimental setup
In particular, different from REINFORCE, we rewrite CartPole using categorical policies to calculate, which is always used in discrete action space, and other environments use Gaussian policies, which are always used in continuous action space. The parameters of the experiment are set as follows: the Neural Network expects that CartPole is 8 3 8, others are 64 3 64. In an experiment, the training horizon of CartPole is 100, Walker2D and HalfCheetah are 500, and Hopper is 1000. To be fair, the learning rate of these algorithms is set to 0.01. The number of timesteps of CartPole is set to 5 3 10 5 , and others are 1 3 10 7 . The batch size of CartPole and Hopper is 50, and the ones in Walker2D and HalfCheetah are 100. Some other hyperparameters such as h, n, a of IS-MBPG in our algorithm are also same to the original paper, and specific numerical values can be obtained from the appendix of the original paper. For ease of reading, the specific parameter settings are shown in Table 2.
Specially, similar to HA-MBPG and IS-MBPG, we use the system probes to represent the complexity of samples, which is a better measurement standard for training. The system probes avoid the problem of returning failure caused by different sample lengths in the training process. In the experiments, we achieve a faster convergence rate than the other three algorithms by comparing running time for the same system probes; meanwhile, the average episode return is close to the latest algorithm IS-MBPG. In addition, during data training, considering about preserving data privacy, we add differential privacy protections when calculating average episode returns, which increases security during data training. We add a stochastic Laplace distribution factor h k (t) in gradient updating, which is between 0 and 1. The stochastic factor h k (t) is used to simulate the interference items during the network transmission, which can increase the security of data transfer process, and we demonstrate the influence of different e-privacy level on our algorithm in experiments.

Experimental results
As shown in Figure 1, we deploy four algorithms, which are DP-RBPG, IS-MBPG, HA-MBPG, and REINFORCE, in the same environment. In CartPole environment, we can see that our algorithm is close to IS-MBPG, which is obviously better than HA-MBPG and REINFORCE; to be more precise, the training results of the average training episode reward of IS-MBPG and DP-RBPG are close to 90, and HA-MBPG and REINFORCE are close to 85 and 78. As shown in Figure 2, because there is little difference in lowdimensional data of training, we record the training time of 500 epochs; the time of these algorithms is similar, and the fastest one is REINFORCE, followed by DP-RBPG. Because the algorithm has a certain degree of randomness, we choose the best training result of each algorithm as the final result, and the running time of DP-RBPG is 99% and 96% of IS-MBPG and HA-MBPG, respectively, in CartPole; though the convergence rate of DP-RBPG is 1% slower than REINFORCE, the average episode return is much better than it. In Walker environment, as shown in Figure 1, we can see that the average episode return of our algorithm is close to IS-MBPG, which is obviously better than HA-MBPG and REINFORCE. In particular, the training results of IS-MBPG and DP-RBPG are close to 350, and HA-MBPG and REINFORCE are close to 290 and 230. Meanwhile, different from training lowdimensional data in CartPole, as the data dimension goes up, we can see that DP-RBPG reaches a state of convergence more quickly than IS-MBPG and HA-MBPG. In Figure 2, we record the training time of 200 epochs; although REINFORCE is also the fastest one, the performance is poor. The running time of DP-RBPG is 87% and 82% of IS-MBPG and HA-MBPG, respectively, in Walker environment.
In Hopper environment, as shown in Figure 1, all of these algorithms are convergent to 1000. However, we can plainly find that our algorithm DP-RBPG converges faster than IS-MBPG, HA-MBPG, and REINFORCE from the beginning to the maximum. Due to the fluctuation of data training greatly, the training result is the average episode return, the batch size is 50,000, and we divide it into five rounds. The results show that our algorithm can effectively converge. In Figure 2, we can see that the running time of DP-RBPG is lower than IS-MBPG, HA-MBPG, and REINFORCE in 200 training epochs. The running time of DP-RBPG is 89%, 85%, and 95% of IS-MBPG, HA-MBPG, and REINFORCE, respectively, in Hopper environment. In HalfCheetah environment, as shown in Figure 1, IS-MBPG is the best one in average episode return, which is close to 240. The average episode return of our algorithm DP-RBPG is close to 200, and the growth trend is roughly the same with IS-MBPG. The training results of HA-MBPG and REINFORCE are close to 120 and 250. In Figure 2, we can find that HA-MBPG costs the most time in 200 epochs; although REINFORCE is also the fastest one, the performance is poor in Figure 1. The running time of DP-RBPG is 91% and 85% of IS-MBPG and HA-MBPG, respectively, in HalfCheetah environment.
In summary, from the above experimental results, the running time of our algorithm is lower than IS-MBPG and HA-MBPG in the same environment. Although the running time of REINFORCE is lower than our algorithm, the performance is poor. The reason is that REINFORCE is an update of the turn-based system, which has a big variance in gradient estimation. We can get a conclusion that our algorithm iterates more times than others in the same time period, which accelerates gradient descent. Moreover, we can see that the average episode return is close to IS-MBPG in all environments. And we can see that the experimental results in CartPole are unconspicuous; as the dimensions go up, we can get the difference between the running time in these algorithms.
In addition, we bring in the differential privacy protection mechanism, which adds a stochastic Laplace distribution factor h k (t) that is between 0 and 1. In order to show the performance of DP-RBPG in different environments, we test it in CartPole and Walker2D environment. As shown in Figure 3(a), we test our algorithm DP-RBPG with different e-privacy level (e = 0, 0.2, 0.5, 0.8) in CartPole environment. Due to the randomness in data processing, we used the average result after 20 training sessions; we can see that as the e-privacy level increases, the average episode return decreases gradually. The average episode return of DP-RBPG outperforms the DP-RBPG (e = 0.2, 0.5, 0.8) by 3.5%, 4.9%, and 5.7%, respectively. The reason for the average episode return degradation is that we add noise to every gradient descent, which ensures the privacy of the data training. Similarly, as shown in Figure  3(b), we test our algorithm DP-RBPG with different e-privacy level (e = 0, 0.2, 0.5, 0.8) in Walker environment. We used the average result after 10 training sessions; we can see that DP-RBPG works best when e-privacy level is 0. The average episode return of DP-RBPG leads to 2.1%, 3.3%, and 5.5% improvements over the DP-RBPG (e = 0.2, 0.5, 0.8).

Conclusion
In this article, we proposed a DP-RBPG in CCN, which can solve the problems of mobile multimedia data transmission in CCN. In the process of data packet transmission in CCN, traffic optimization and congestion control problems can be formulated into policy decision problem based on the current network delay, packet loss rate, and maximum throughput. DP-RBPG can generate a policy to adjust network state through constantly interacting with the environment. DP-RBPG selects one randomized block coordinate in gradient updating at every round; we compared it with other algorithms. The experiments prove that the training time of DP-RBPG is lower than others in the same training epochs. Moreover, we bring in a differential privacy protection mechanism, which adds a Laplace distribution factor in gradient updating, and we prove that it preserves the e-privacy level. Therefore, DP-RBPG has solved the problems of large computations caused by calculating all dimensional data and the privacy breach in data training. This method can solve the problem of mobile multimedia data transmission in the CCN, and this method also can protect user privacy, improve network performance and user experience. Meanwhile, DP-RBPG can be applied in some practical problems of the conventional network, such as traffic optimization and congestion control. These applications can be regarded as policy decision problems. We will study these applications in our future work.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the National Natural