A recurrent reinforcement learning approach applicable to highly uncertain environments

Reinforcement learning has been a promising approach in control and robotics since data-driven learning leads to non-necessity of engineering knowledge. However, it usually requires many interactions with environments to train a controller. This is a practical limitation in some real environments, for example, robots where interactions with environments are restricted and time inefficient. Thus, learning is generally conducted with a simulation environment, and after the learning, migration is performed to apply the learned policy to the real environment, but the differences between the simulation environment and the real environment, for example, friction coefficients at joints, changing loads, may cause undesired results on the migration. To solve this problem, most learning approaches concentrate on retraining, system or parameter identification, as well as adaptive policy training. In this article, we propose an approach where an adaptive policy is learned by extracting more information from the data. An environmental encoder, which indirectly reflects the parameters of an environment, is trained by explicitly incorporating model uncertainties into long-term planning and policy learning. This approach can identify the environment differences when migrating the learned policy to a real environment, thus increase the adaptability of the policy. Moreover, its applicability to autonomous learning in control tasks is also verified.


Introduction
Recently, reinforcement learning (RL) has shown its immense potential for processing complex and large-scale tasks. [1][2][3] In particular, it becomes a useful approach to realize optimal control in robotics since data-driven learning leads to non-necessarity of engineering knowledge, 4 which is usually difficult to obtain. [5][6][7] However, learning is prohibitively slow, that is, the required number of interactions with the environment is impractically large. Even in problems with low-dimensional state spaces or fairly benign dynamics, thousands of trials are usually required in learning. This data inefficiency makes it impractical to apply RL to real robotics and prohibits RL approaches in more challenging scenarios. Thus, learning is generally conducted with a simulation model, and after that, a migration process is required from the simulation environment to the real environment. However, the errors (commonly referred to as reality gap (RG)) between the simulation environment and the real environment make it challenging to apply the learned policy to the real environment. 8 In general, adding additional measurement sensors can increase the adaptability of a learned policy, 9 but this is both cost inefficient and time inefficient, and furthermore, the difference between the simulation environment and the real environment cannot be clearly understood. Therefore, it is crucial to train an adaptive policy that can be applied to an environment with high uncertainty.
Suppression of RG caused by model difference and/or uncertainty in policy migration has become a hot research topic. Motor primitives have been introduced to accelerate learning speed and reduce task complexity as well as the number of trials, 10,11 but retraining needs to be conducted after migration. Trials focusing on system identification have also been reported, 12 which provide a framework for solving the problem. System identification can help to generalize the knowledge of the system to unobserved states, thus reducing the number of trials for policy optimization. [13][14][15] However, the learned policy still relies on the number of trials and the quality of the data.
On the other hand, uncertainties are treated as noise, which can be handled by a robust control policy. Lee et al. 16 used a Bayesian network to estimate the error among environments and developed a robust policy. However, since uncertainty could not be fully considered, this approach was effective only when the dynamic model was a good approximation of the real environment. Unfortunately, this condition usually cannot be satisfied in a complex dynamic system like an actual robot. 17 To solve the problem, Yu et al. 18 built an online system identification model that was able to consider all the uncertain factors, but the accuracy of Q-value would be decreased since the window of motion history was narrowed, and furthermore, it took model parameters as a training target.
In this article, we propose a recurrent RL approach, which is based on the deep deterministic policy gradient (DDPG) architecture. 19 It can achieve an adaptive policy by combining an environmental encoder (EE) with a universal policy. As recurrent neural network (RNN) can integrate the information across time frames, 20 the EE is built by RNN from motion history (time series of stateaction pairs). The proposed approach is called as recurrent DDPG (RDDPG). The critic network and EE are trained to estimate Q-values in any possible situations of an uncertain environment. The latent variables between the EE and the critic network are defined as meta-parameters, which are used to identify the parameters in continuous stateaction domains. Thus, this approach can give an accurate estimation of the Q-values and consequently achieve an adaptive policy.
This article is organized as follows: the related work is described in the second section, and the key ideas of the proposed approach, that is, learning framework, policy improvement, and unsupervised learning of the EE, are given in the third section. The fourth section describes the simulation experiment and discusses the effectiveness of the proposed approach.

Related work
Deep Q-network is a well-known deep RL method proposed by DeepMind. 1 It achieved massive success in higher-dimensional problems with discrete action spaces, such as the Atari game. However, in many tasks of interest, especially physical control tasks, the action space is continuous. To solve this problem, DDPG was proposed. It is an algorithm which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function and uses the Q-function to learn the policy. DDPG gained great success in observable problems, such as the cart-pole swing-up task and the reaching task. It can learn value functions in stable and robust way because the network is trained off-policy with samples from a replay buffer to minimize correlations between samples. The network is trained with a target Q-network to give consistent targets. 19 However, the task studied in this article can be classified as a partially observable Markov decision process (POMDP) due to the existence of environmental uncertainty. The agent cannot directly observe the parameters of the environment because they fall outside of the history window. RNN can solve this type of problem since it learns across time series data.
In general, POMDP is a sequential decision-making model where the underlying states of an environment are partially available or the observation received by the agent is an incomplete state. It can be described as a 6-tuple S; A; P; R; O ; and O ð Þ , where S, A, P, and R are states, actions, transitions, and rewards, respectively, whereas O and O are the observations and conditional probability, respectively. The agent receives observation o 2 O instead of the complete system state s 2 S. The observation is generated from the underlying system state according to the probability distribution o*O S ð Þ. RL has no explicit mechanisms for deciphering the underlying state of POMDP and it is effective only when the observation reflects the state of the environment. In our case, the Q-value and action of the learned policy cannot be accurately generated from the observation of POMDP since To solve this problem, we narrowed the gaps between the two pairs, that is, Q o; a ð Þ=Q s; a ð Þ and a o ð Þ=a s ð Þ. The parameters of an environment can be expressed by the time series of state-action pairs. Therefore, it is possible to extract information from the time series by RNN to refine the observation. Thus, we introduced recurrence to RL to build meta-parameters so that the actual Q-values could be estimated from a value network. In applying RNN, we used long short-term memory (LSTM), which is designed to supervise time series learning for long-term dependencies, for solving the problem where errors propagate back in time. The problems of vanishing/exploding gradients in LSTM can be prevented by using constant error carousels (CECs). 21 Figure 1 shows the architecture of LSTM. It adds or deletes information with gates, which can selectively allow information to pass through a sigmoid layer s. LSTM uses three gate structures, that is, forget gates, input gates, and output gates. Forget gates yield a vector f t according to the output y tÀ1 in the previous moment and the input x t in the current moment. Input gates determine CECs c t according to middle information i t andc t . Output gates determine output y t according to c t and o t . The error between the output prediction y t ðx t jw l Þ and the target y Ã t x t ð Þ is minimized by updating the weight w l , and this updating is conducted at each time step as Since the physical parameters of the environment are changed randomly, the Q-value changes in each episode of learning even though the policy is unchanged. This change causes the uncertainty of the value network. However, an accurate value network is the premise of an optimal policy. To solve this problem, we use metaparameters, which are generated by LSTM to reflect the change of the Q-value. Taking meta-parameters as additional input in the learning, the value network can be specific in each episode.

RDDPG architecture
In RL, an agent receives a state s t and takes an action a t based on the state s t at time t and then the environment encounters a new state s tþ1 and a reward r t . Since this article focuses on model-free learning, the agent transits s t ; a t ð Þto s tþ1 and gets a reward r t from s t ; a t ; s tþ1 ð Þregarding the task. The deterministic policy p, which is parameterized by ! a , takes state s t as an input and generate action a t as the output. The value network, which is parameterized by ! c , takes state s t and action a t as inputs and yields discounted future reward as the Q-value 22,23 where g 2 0; 1 ½ is a discounting factor. The objective of the value network is to predict the expected discounted future reward. That of the policy network is to maximize the Q-value, which is assumed to be the return estimated by the value network.
The objective of RL is to maximize the Q-value shown in equation (2). The key problem here is the lack of parameter identification when the environment is changing. This makes the learned policy weak in adaptability. To address this question, we introduce recurrence to DDPG for parameters identification and call the improved approach as RDDPG. In this approach, the policy determines the action depending on time series data rather than current state.
DDPG is an actor-critic algorithm, 24 which can learn policies in continuous action spaces, the optimization procedure in RDDPG is to update the policy network and the value network alternatively. The process is described in Figure 2, where LSTM, as an EE, yields meta-parameters as an additional input of the value network and the policy network. The value network is trained to estimate the Qvalue by minimizing the temporal difference error. The policy network, that is, a nonstationary, meta-conditioned, deterministic policy, 25 then yields a specific action by maximizing the Q-value. The EE is not parameterized by a certain task objective. Instead, it is optimized by a gradient back-propagation of a value network. Hence, the value network leads to a relatively accurate estimation of the Q-value, and the policy network takes an accurate action even though uncertainties exist.
Compared with a typical training scenario, in which a teacher and a student are deterministic single-task participants, an EE is a processor of time series shared across different environments. It provides meta-parameters to a single value network and a single policy network to deal with different environments. Explicitly, the EE, which is parameterized by w p , takes the transition st t ¼ s tÀ1 ; a tÀ1 ; r tÀ1 ; s t ½ as input, which contains a state-action pair, and yields meta-parameters as The update rule is w a arg max in which Q is parameterized by ! a . In the optimization of ! a ; the gradient of ! p was ignored since mp t was taken as a constant.

Recurrent update
DDPG updates the parameters of a network from samples of replay buffer R, a finite-sized cache consisting of a fixed number of transitions st jt ¼ s jtÀ1 ; a jtÀ1 ; r jtÀ1 ; s jtþ1 Â Ã where 0 jt < J Â T , in which J is a constant number and T is the maximum number of steps in an episode. LSTM generally uses a sequential update method to perform updates. 21 This method has the advantage of carrying LSTM's hidden state forward from the beginning of the episode. However, it is against the random sampling policy of RL 26 since it conducts sampling sequentially episode by episode rather than at the whole set of replay buffer. To overcome this problem, we tried to update at randomly selected segments of the episodes in the replay buffer. In the meantime, we need to determine the hidden state of EE at the beginning point of each randomly selected segment.
Here, we tested two types of updates. One is "zero-state update," which initializes the hidden state parameters of EE to zero at beginning points, and the other is "save-state update," which saves the hidden state parameters of each step in the replay buffer. Therefore, we used "zero-state update" in this research since it could decrease the space complexity of RDDPG. It does not need to save the hidden states of every step.
Buffer R stores st at every step of each episode. During the training, it randomly builds a set consisting of N subsets, L ¼ ST 0 ; ST 1 ; . . . ; ST N À1 ½ , and each subset consists of K time sequence transitions sampled randomly from the buffer, The symbol b n ð Þ represents the beginning point of ST n , which is a randomly chosen number satisfying the following condition where J is a constant number, and T is the maximum number of steps in an episode. Since st is a complete transition, we can build the set of the current states  well as the set of the rewards R n . The main network in Figure 2 estimates the current Q-value. For instance, if N ¼ 1, the EE takes st b 0 ð Þþk k 2 0; K À 1 ½ ð Þand the hidden state of the last step as inputs. It yields meta-parameters mp b 0 ð Þþk and hidden state that are used as inputs of the EE for the next step of training. Meanwhile, the metaparameters mp b 0 ð Þþk , the state s b 0 ð Þþk , and the action a b 0 ð Þþk are inputted to the value network to yield the Q-value of the current step. The target network in Figure 2 estimates the Q-value of the next step in the same approach as the main network estimating the Q-value of the current step, but the action set A n is generated with the policy network.
On the other hand, to perform an update, we need to calculate the gradients of the EE at each step. LSTM usually calculates the gradients by minimizing the error between the predicted output and the target output. 27 But we do not use this process since EE does not take the parameters of the environment as targets in training. In this study, the gradients were calculated with equation (5), where the weight ! p of the EE and the weight ! c of the value network were updated by minimizing the temporal difference error. The weight ! a of the policy network was updated by maximizing the Q-value.

The algorithm
Errors are inevitable due to the difference between the actual Q-value Q Ã s; p s ð Þ; p ð Þ ½ and the calculated Q-value Q s; p s ð Þ ð Þ ½ , 28 where p represents the parameter set of the environment which changes randomly in each episode. The change of parameters makes it difficult for the value network to converge to the actual Q-value and to optimize the policy network. In this study, since learning was performed with a simulation model by randomly changing the model parameters, we supposed that the learning was conducted with many randomly distributed models.
There is no feedback in DDPG for distinguishing the models. The policy obtained with DDPG may be optimal to an "averaged" model among all the models, but it cannot provide an optimal policy to the tasks with different environments because the Q-value is not related to model changes. While, in RDDPG, owing to the existence of the EE, the learned policy is associated with each specific model because meta-parameters can reflect model parameters as feedbacks. The detail of RDDPG is shown in Algorithm 1. EE generates meta-parameters from the time series of state-action pairs. As a result, a policy can be obtained by incorporating mp t as an additional input into the value network and the policy network. Using mp t can help to narrow the gap between Q Ã s; p Ã s; p i ð Þ; p i ð Þand Þ . This means RDDPG can treat the problem of POMDP. Here, it should be noted that metaparameters mp keep model parameters, but they are neither model parameters nor measures that help to find model parameters.

Experiment
When we apply a policy that is learned on a simulation model to a real environment, the situation is the same as that we apply a policy learned on one environment to another environment. To confirm the effectiveness of RDDPG, three types of tasks were constructed, each one contained an environment with a few uncertainties. We conducted the experiments using a low-dimensional state description with joint angles and positions. The characteristic parameters (the weight and length of each link, the damping of each joint, etc.) were changed randomly within a certain band in each episode. Figure 3(a) is a two-degrees of freedom (2-DOF) cart-pole model. Figure 3(b) and (c) is 2-DOF manipulators with and without loads, respectively. The task shown in Figure 3(a) was to control the pole to keep a vertical position. In each training episode, the mass and length of the pole changed randomly. On the other hand, the tasks shown in Figure 3(b) and (c) were to control the manipulators to reach certain positions in their working spaces.   As shown in Figure 3(a) and (b), the environments could be considered having uncertain bounded parameters, while the task shown in Figure 3(c) was a task with external load. The ranges of the parameters are given in Table 1. Figure 4 is a comparison of the performances of DDPG and RDDPG for different tasks. The total reward was defined as where e t ð Þ and a t ð Þ represent the positioning error and action value at the step t, respectively, whereas^and S are positive definite matrices.
It can be seen from Figure 4 that RDDPG had better total reward than DDPG during learning. In the cart-pole task, the two algorithms gave almost the same results, as shown in Figure 4(a), but RDDPG was more stable. This result indicates that meta-parameters can efficiently reflect the  parameters of an environment. To further confirm the effectiveness of RDDPG, we compared the control performance of the two algorithms for the model robot, as shown in Figure 3(c). Figure 5(a) and (b) is the action (driving torques) at each step, and Figure 5(c) gives positioning errors of the end of the manipulator at each step. The positioning errors and residual oscillations with the policy learned by RDDPG were relatively small in comparison with DDPG, and the oscillations of joint torques were also limited. The two algorithms showed different features in the adaption to an environment. Although optimum control and feedback control invoke different philosophies, 29 the results obtained in the experiment demonstrate that RDDPG could provide the robust performance as a feedback controller which could reduce steady-state errors, whereas DDPG behaved as an open-loop controller.
The physical parameters of the model robots, as shown in Figure 3, were arbitrarily chosen. The same learned policy could provide almost the same manipulation performance even though the parameters of the environment were changed within an "error band." Figure 6 compares the performance of the learned policies for the cart-pole model and the 2-DOF manipulator model by RDDPG and DDPG. The former provided higher total reward than the latter. To further investigate the capability of RDDPG, the learned policy was applied to the model robot, as shown in Figure 3(c), with changing external load. The total rewards are shown in Figure 7. RDDPG gave a better performance than DDPG at any loads, demonstrating an excellent capability in dealing with an uncertain environment.

Conclusions
To make a learned policy on a simulation model to adapt to a real environment with limited uncertainties, an RL approach, which is called RDDPG, was proposed. It features the use of the EE and extensive training by changing environment parameters randomly in a limited range. Simulation experiments were conducted and the results demonstrated that the learned policy could adapt to a dynamic model with high uncertainties, and indeed, it dentified the parameters of the model in real time. Simulation experiments on three model robots showed that RDDPG could reduce positioning errors and residual oscillations of both positions and joint torques considerably as compared to traditional DDPG. It is different from DDPG in adapting to an uncertain environment. It relies on the EE that exploits time series data. The EE enables it to deal with the uncertainties of the characteristics of an environment. It is believed that the high adaptability of RDDPG comes from the precomputing of the possible models and the identification of parameters with an EE, which is used to identify the parameters. This process can avoid excessive reliance on the accuracy of the parameters.
In the forthcoming study, we will apply the proposed approach to an actual robot to confirm its effectiveness to real problems.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article:

ORCID iD
Yang Li https://orcid.org/0000-0002-0840-5796 Figure 7. Performance of the learned policies in the loadmanipulator task. The mass of the load increased gradually with episodes, while other parameters were kept constant. At each test episode, the total rewards were obtained by applying the learned policies under current condition.