Continuous reinforcement learning to adapt multi-objective optimization online for robot motion

This article introduces a continuous reinforcement learning framework to enable online adaptation of multi-objective optimization functions for guiding a mobile robot to move in changing dynamic environments. The robot with this framework can continuously learn from multiple or changing environments where it encounters different numbers of obstacles moving in unknown ways at different times. Using both planned trajectories from a real-time motion planner and already executed trajectories as feedback observations, our reinforcement learning agent enables the robot to adapt motion behaviors to environmental changes. The agent contains a Q network connected to a long short-term memory network. The proposed framework is tested in both simulations and real robot experiments over various, dynamically varied task environments. The results show the efficacy of online continuous reinforcement learning for quick adaption to different, unknown, and dynamic environments.


Introduction
Real-time motion planning of robots often needs to consider multiple and sometimes conflicting optimization criteria, such as time efficiency (in terms of the shortest distance or time), safety (in terms of the clearance to obstacles), and energy efficiency. 1-3 A common practice is to combine these criteria in a cost function as a weighted sum. However, determining proper values for the coefficients in the cost function is not a trivial issue but often done manually in an ad hoc manner. It is difficult to determine the coefficient values of a combined optimization function before having the robot perform in an environment (i.e. all the set values may not be optimal). Moreover, when the task environment changes, the previously set coefficient values may not be suitable anymore.
Hence, there are two related open problems: (1) how to determine values for coefficients of a compound optimization function automatically and (2) how to make the coefficients self-adapt to environmental changes. There is little work on both problems. Ishigami et al. 4 tried to generate different paths with different sets of coefficients and evaluate theses paths based on a metric, but the values of all coefficients still need to be set off-line manually. The main contribution of this article is that we propose to tackle both problems by continuously training a reinforcement learning (RL) agent in different environments, even during the test. The agent is trained to adjust the values of the coefficients of a multi-objective optimization function based on the robot's performance in an environment with unknown dynamic obstacles, and the agent keeps learning by itself to best adapt to all kinds of environmental changes continuously. During the process, the agent becomes more and more knowledgeable and better and better at learning.
Specifically, we introduce a continuous RL agent that leverages motion planning results and can accommodate real-time robot motion planning methods. The observations of the agent are motion trajectories, whose dimension is far lower than the dimension of raw sensor data (such as the image of a camera) that is often used in endto-end learning. Therefore, a convolutional neural network (NN) that is used to extract observation features can be removed from the agent. This simplifies the formulation of the agent and enables continuous learning. Learning continuously means that 5-7 the agent can accumulate the knowledge learned in the past environments to help future learning and problem-solving and that later learning does not degrade much its performance in task environments learned earlier.

Related work
Lifelong or continuous learning has been a long-standing challenge for machine learning and autonomous systems. [8][9][10] Mimicking humans and animals that continuously acquire new knowledge and transfer them to new tasks throughout their lifetime, continuous learning builds an adaptive system that is capable of learning from a continuous stream of information. However, dilemma between plasticity and catastrophic forgetting 11,12 is the main challenge due to inefficiency and poor performances when relearning from scratch for new tasks. 6 Thus, accommodation of new knowledge while not forgetting or interfering current learning process requires a sophisticated approach to consolidate knowledge. Memorizing past experiences [13][14][15] or allocating NN resources dynamically 16,17 has been introduced to alleviate the forgetting problems. Replaying previous experiences 18,19 improves learning efficiency by reducing sequential dependence of data and forgetfulness of important experiences, but experience replay only does not provide enough self-adaptability to dynamically changing environments.
Acknowledging the impossibility of providing all the prior knowledge to perform well in the real-world, continuous learning models has been emerged for robots. One aim of the continuous learning is to train an agent that can quickly adapt to new tasks (e.g. changing navigation goals or environments). A hierarchical Bayesian model 20 has been used to transfer the knowledge learned between different but related tasks. Actor-Mimic 21 conducts the training over related source tasks, but the generalization to new target tasks needs a sufficient level of similarity between the source and the target tasks. Zhang et al. 22 employ the conception of successor factors into RL. Besides, universal value function approximators (UVFAs) 23 propose to handle different goals in one environment by incorporating the goal into the input of the value function. The generalization ability of navigating new tasks can also be achieved by combining RL with conventional motion planning methods. 24,25 However, the above method can only transfer the learned knowledge to new tasks just slightly changed from the training tasks, for example, removing or adding one or two static obstacle(s) in the test environment.
A few researchers consider transferring the trained agent to totally different environments with only static or moving obstacles. For example, one method called reinforcement planning 26 extends RL to automatically learn cost functions for search-based planners such as A* and Dijkstra's algorithm. The method successfully transfers the trained agent to some new static environments without moving obstacles, but it lacks the ability to improve the performance continuously as the learning progresses in different, dynamically changing environments. Everett et al. 27 use a long shortterm memory (LSTM) network to handle an arbitrary number of obstacles, but there are only moving obstacles and the number of moving obstacles is fixed in each environment. There is little research on solutions that can continuously train an RL agent with feedback to adapt the robot motion to changing mixed environments, where there are a variable number of static and moving obstacles. This leads us to propose a continuous, incremental knowledge acquisition model, which is essential for a lifelong learning robot that does not lose important knowledge obtained through a motion planner.

Overview of the framework
The proposed framework is illustrated in Figure 1. There are two closed loops: one is among a real-time motion planner, a robot, and an environment and the other one is between the real-time motion planner and an RL agent. In the first closed loop, the real-time motion planner interacts with the task environment based on sensing and generates the best motion trajectory according to the current multi-objective cost function for the robot to execute. Here, the "coefficients" are used in the linear combination in the cost function. In the second closed loop, an RL agent develops proper coefficients for the multiobjective cost function to maximize the desired performance in real-time motion planning.
The RL agent has three NNs: an LSTM NN, an online NN, 28 and a target NN. The three networks can be classified in two parts. One includes the LSTM network, which is used to preprocess the motion trajectories. This will be described in detail in "Observation" section. The other includes the online network and the target network, which are used for RL. The online network is continuously trained based on the continuous observation, reward, and loss calculated using the target value from the target network. The target network is created by copying the online network during initialization. During training, only the online network is trained, and then its weights are copied to the target network periodically. Note that directly implementing RL with only the online network is unstable, because the network being trained is also used to calculate the target value. To solve this problem, the separate target network is used to calculate the target value. The target network can greatly improve the stability of continuous RL and enable the agent to continuously build skills on how to adjust the coefficients to best adapt to environmental changes. If we use only the online network, the learning is prone to diverge. Note that the NN weights in RL and the planner's optimization function coefficients are different. The former is trained through backward propagation, and the latter is tuned by the RL agent.
The proposed framework is validated by a case study with the real-time adaptive motion planner (RAMP) 29,30 and can be easily extended to benefit other real-time motion planners. The framework has the following characteristics: 1. The skills learned from the earlier task environments are accumulated and used to make the learning in the current task environment more efficient. 2. While the robot benefits from the learned skills provided by the learning agent to perform and adapt well in constantly changing environments, the learning agent itself is trained continuously through the online network to improve its own performance.
3. The proposed framework is general and abstract and, hence, is independent of real-time motion planners, platforms, and task environments. 4. The agent does not learn specific coefficients but learns a mapping from motion trajectories to coefficient changes. This enables the robot to adapt to different numbers of obstacles moving in unknown ways during the navigation. 5. The learned skills can be transferred to different types of task environments and motion planning objectives and from a simulated world to the real world. 6. By using motion trajectories as feedback observations, the formulation of the RL agent is greatly simplified. This makes continuous training much more efficient and hence effective. 7. The learned optimization coefficients have clear meanings, so the resulting robot performance is understandable. This feature allows us to analyze the relationship between robot motions and coefficient changes.
The rest of this article is organized as follows. The fourth section describes our RL strategy. The fifth section provides a case study with RAMP as the motion planner. The sixth section provides and discusses the training and test results in both simulations and real robot experiments. The last section concludes the article.

RL-based coefficient determination
In this section, we will introduce the method of continuously training the RL agent in the proposed framework. Generally, the agent with discrete action space provides better stability than that with continuous action space, so we use Q networks to adjust the coefficients discretely. The input of the Q networks must have fixed length, but the observation input contains a hybrid trajectory with variable length. Therefore, we use an LSTM network to encode the information of the hybrid trajectory into a fixed-length representation.
Besides, when NNs are used in RL, the samples are assumed to be independently and identically distributed. However, when the samples are generated from exploring in an environment sequentially, this assumption doesn't hold. Therefore, to minimize correlations between samples, the agent is trained off-policy with samples from a replay buffer. Using replay buffer may slow learning, but in practice we found that this is greatly outweighed by the stability of learning.

Observation
An observation is the input of the RL agent, but it does not contain environmental information. First, we are interested in an agent being able to handle different goals, so we follow the approach from UVFA, 23 that is, the observation includes the navigation goalg. Second, the observation should include the hybrid trajectory that is the real trajectory concatenated with the planned trajectory from a motion planner, because the real trajectory and the planned trajectory capture the information of the robot motion state and the surrounding environment, such as the positions of obstacles. Two examples of hybrid trajectories are shown in Figure 2. The switching of tracked trajectories occurs at the points with numeric marks. In Figure 2(a), the robot is going to follow trajectory 0-B. In Figure 2(b), the robot has arrived at point 1, and the robot's real trajectory is 0-A 0 -1, which is different from 0-A-1 due to robot model and motion uncertainty. In Figure 2(c), the robot switches to a new planned trajectory 1-D. In summary, here the hybrid trajectories are 0-A 0 -1-B and 1-C 0 -2-D, which are the real trajectories (0-A 0 -1 and 1-C 0 -2) concatenated with the unexecuted part (1-B and 2-D) of the planned trajectories.
We useg to represent the navigation goal and tðtÞ to represent the hybrid trajectory at a time instant t. A navigation goal is defined asg ¼ x g ; y g ; g , and a trajectory is a sequence of robot motion states. A robot motion state is a collection of a robot pose, a velocity, and an acceleration with a time stamp. The kth motion state in a trajectory is , where x and y are the position coordinates, is the orientation, and t k is the time stamp ofs k . Therefore, we have tðtÞ 1 s 0 ;s 1 ; Á Á Á ;s K ð Þ , where K is the number of motion states in tðtÞ.
Note that K is not a constant. Thus, an LSTM NN is used to transfer the variable-length trajectory (tðtÞ) into a fixedlength vector (h K ) that contains relevant information of all motion states, as in handling the varying number of obstacles with LSTM. 27 As shown in Figure 3, the pertinent information of each motion state in tðtÞ is stored in the cell statec k , which is the "memory" of the LSTM network. The cell state acts like a conveyor belt that transfers relative information all the way down the chain. The cell also outputs a hidden stateh k based on the current cell state. In mathematical terms, the forward propagation of the LSTM network can be formulated as where s is the sigmoid function, and W f ,  motion states in tðtÞ. The vectorh K has a fixed length and is concatenated withg to form a new vectorṽ g , and thenṽ g is fed to a Q network with three fully connected layers. The outputs are the Q values for all possible actions.ã is an action vector, which will be introduced in the next section. The learning rate is set to 0.005 initially and reduced to 0.0005 through exponential decay, and other parameters of the LSTM network and the Q network are presented in Table 1.
The best action is selected by the e-greedy strategy, where e is the ratio of exploration. At the beginning of learning, we know little about the environment, so we must do more exploration by setting e to a big value. After the network is trained for some time, we know a lot about the environment, so we can do more exploitation by setting e to a small value. In a new environment, our e is set to 0.5 at the beginning of learning and reduced to 0.1 through exponential decay when the network has been trained for 30 episodes in this environment.

Action
Now suppose we have n coefficients W 1 , W 2 , Á Á Á, W n that need to be adjusted, which come from the same multiobjective cost function. Note that we are only interested in the relative costs of different candidate trajectories, so coefficient vectors Þis only n À 1. Therefore, an action is represented as an n À 1 dimensional vector Since the action space of the Q network is discrete, the coefficients must be changed discretely at a In other words, each coefficient can be increased or decreased by d W or remain unchanged at a time. The total number of possible actions is N act ¼ 3 nÀ1 . If the selected real-time motion planner has multiple optimization criteria whose coefficients need to be adjusted, one of the coefficients can be set to 1.0 and, hence, can reduce one DOF. Other coefficients are adjusted independently.

Reward
An RL agent should learn from performance-driven feedback. This feedback can be modeled as a simple delayed reward function used in the training process, such as the time to reach a goal location or the number of collisions. Our reward r is calculated to evaluate the robot's performance as follows: where I c ¼ 1 if the robot has collided with obstacles 0 if the robot has not collided with obstacles & and the constant T m is the time limit to move from the start location to the goal region. The constant R a > 0 is a fixed reward for arriving at the goal region within T m . The variable t a is the actual execution time to the goal region. The variable t o is the estimated time until a collision when moving along the current best trajectory. t o is þ1 if no collision is predicted. Considering this internally estimated collision time in the reward function can make the robot learn to move at a certain distance from obstacles, because the feedback information about obstacles can be returned before the robot really collides with obstacles. This feature is a superiority of the proposed framework. The constant T p is the time penalty when a collision occurs in simulation or a forced stop of the robot occurs in real experiments (to avoid actual collisions). By increasing the cumulative  reward over time, the robot will try to arrive at the goal region as soon as possible while avoiding collisions. Each training episode will keep going until the robot arrives at the goal region or T m is reached. According to equation (2), the agent can only receive a positive reward when it succeeds in arriving at the goal region within the time limit T m . This feature often reduces the stability and the efficiency of the training. We address this problem by following the approach from Hindsight Experience Replay. 31 For the failed navigation, we can double the training samples in replay buffer with disguised successful case, that is, each transition is stored in the replay buffer twice: once with the navigation goal at which the robot failed to arrive within T m and once with the goal corresponding to the motion state of the robot at time instant T m .

Updating of weights in NNs
There are totally three NNs in our RL agent: the LSTM network, the online network, and the target network. The weights of the online network and the LSTM network are updated by the backward propagation of the error e, and the weights of the target network are copied from the weights of the online network periodically. The interval (or number of steps) for this copying is a hyper-parameter, which is determined by a pilot test in the simulation. The error e is e ¼ y À QðS ;ãÞ ð 3Þ where y ¼ r if at the goal region r þ g Á maxã 0 Q 0 ðS 0 ;ã 0 Þ if not at the goal region ( andã is the action vector used to change the coefficients.
Þis the current observation andS 0 is the next observation after applyingã. r is the current reward and g is the discounting factor. Note that QS;ã is calculated using the online network, and Q 0S 0 ;ã 0 is calculated using the target network. The backward propagation of e in the online network and the LSTM network are well-known and, hence, are omitted in this article for brevity. 19,32 The weights of the LSTM network are updated to learn how to transfer a variable-length trajectory into a fixed-length vector. The weights of the target network are updated to learn how to adjust the coefficients of a multi-objective cost function in a real-time motion planner, and the weights of the online network are updated to enable the agent to keep learning from different kinds of environments continuously.

Case study
In this case study, we use the RAMP as a selected motion planner module in the proposed continuous RL framework.
We will first review RAMP and then introduce how to utilize it in the proposed framework.

Overview of RAMP
RAMP enables the simultaneous planning and control in dynamic environments. RAMP always maintains multiple trajectories called a population through a trajectory generator. At the start of each control cycle, the lowest cost trajectory in the population is selected as the best trajectory. Trajectory costs are calculated through a trajectory evaluator. While the robot moves along the current best trajectory, RAMP keeps modifying the population based on the latest sensing information. The above process continues until the robot reaches the goal. There are both feasible (collisionfree) and infeasible (not collision-free) trajectories in the population. Sometimes the robot has to follow an infeasible trajectory when there is no feasible one, and the robot will stop if a collision will occur within a short time threshold, called imminent collision. While the robot is stopped, RAMP continues to modify the population until (1) it finds a better trajectory for the robot to switch to or (2) the obstacle causing the imminent collision moves away. RAMP uses different cost functions to evaluate feasible and infeasible trajectories. The cost functions for feasible trajectories and infeasible trajectories are shown in equations (4) and (5), respectively.
where T, A, and D are the estimated execution time, the orientation change, and the distance to the nearest obstacle of the feasible trajectory, respectively, and T c and A c are the estimated time until a collision and the orientation change of the infeasible trajectory. They have the corresponding coefficients W T , W A , W D , W T c , and W A c and the normalization factors N T , N A , N D , N T c , and N A c . Note that the cost for any feasible trajectory is lower than that for any infeasible trajectory. In this case study, W T is set to 1.0, and W A and W D are adjusted by RL. W T c and W A c are set to some fixed human-tuned values.

Continuous RL with RAMP
The continuous RL framework after utilizing RAMP is shown in Figure 4. Here, the target NN is hidden for clarity. In Figure 4, the best trajectory generated by RAMP is represented as t b . The real trajectory of the mobile robot is represented as t r . The hybrid trajectory (t b concatenated with t r ) is represented as t (the operator jj denotes the concatenation). The goal location is represented asg. At the end of each control cycle, the hybrid trajectory is stacked in the replay buffer. Then, the LSTM network takes a hybrid trajectory from the replay buffer through sampling and encodes the motion states of the hybrid trajectory into a fixed-length vector. This vector is concatenated with the goal vector to form a new vector, which is fed to the Q network. At last, the Q network outputs the coefficient changes back to RAMP.

Results
In this section, we present both the simulation and the real experimental results from the above case study with RAMP.

Simulation experiments
The simulations are conducted on the Gazebo simulator using a four-core i5 2. Training. The training environments are shown in Figure 5 (called Tr. A : Tr. D). There are both static obstacles (red barrels) and moving obstacles (blue cars) in the environments. Note that in Tr. D, the static obstacles are dumpsters with black tops, which are bigger than the red barrels. The moving obstacles in Tr. D are also bigger than those in other training environments. All moving obstacles move randomly. Specifically, in every 100 ms, each moving obstacle rotates at a random angular velocity for 20 ms and then advances at a random linear velocity for 80 ms. The ranges of the random angular and linear velocities are approximately 0.3-0.6 rad/s and 1.0-1.4 m/s, respectively. Besides, there are doors that randomly switch or close at different time in Tr. A and Tr. C. In the experiments, we call that the robot completes one episode when it arrives at the goal from the start. The start and the goal of the robot are marked by "1" and "2", respectively. The working zone of the robot is bounded by a black square. Outside the working zone, there are four walls, by which the motions of moving obstacles are restricted. The robot guided by the motion planner can only run in the working zone, but the moving obstacles can move in or outside the working zone. Therefore, the robot works with different numbers of moving obstacles in one environment at different times. For example, now in Tr. C, there are five moving obstacles in the working zone and three moving obstacles outside the working zone. The RL agent was trained for two rounds. In each round, the agent was trained in Tr. A, Tr. B, Tr. C, and Tr. D, sequentially. In each training environment, we trained the agent for 50 episodes and then switched to the next environment. Recall that the coefficients are changed by a fixed step d W , and the target NN is updated every T u control cycles. We used different values of d W and T u to train the agent, and the results are shown in Tables 2 to 5. According to the results, we obtained the best training performance under the setting of d W ¼ 0:05 and T u ¼ 300, so this setting was used in the following simulations.
The curves of the execution time, the number of collisions, the number of coefficient changes, and the loss during the training process are shown in Figure 6. They were filtered by the moving average with a window size of 10. The points marked by "Tr. A, Tr. B, . . . " mean that the training was switched to the corresponding environments. We use a value called n e to measure the skill that the agent has learned, as shown in Table 6. At the start of the training, the agent could not converge (the execution time was still decreasing when we switch the training). After the agent was trained in more environments, it starts learning with a shorter initial execution time and converges faster. This means our agent was learning continuously to optimize its performance in different types of task environments.
Test. To illustrate the generalization ability of the trained agent, we tested the agent trained with d W ¼ 0:05 and T u ¼ 300 in both the training environments and some previously unseen environments that were not used in the training. Note that our agent is a continuous RL model, so even in the test time, the weights of the trained NNs would be still updated. We compared the performance of this continuous learning agent with that of the fixed humantuned coefficients and the performance of a noncontinuous learning agent. The noncontinuous learning agent was trained in the same way as the continuous one, but its NN would be fixed during the test time (it still changes the     coefficients dynamically). Figure 7 shows the new test environments.
1. In Te. A, the walls and the doors in the working zone are placed along the diagonal. 2. In Te. B, there is an area similar to a maze in the upper half of the working zone. It is less cluttered than the training environments, but the distance from the start to the goal is longer. 3. In Te. C, the working zone is directly bounded by walls. Both the robot and the moving obstacles can only move within the working zone, so that the robot will always work with 10 moving obstacles in this environment.
The test results are shown in Figure 8 and Tables 7 and  8. The performance measurements are the same as those in the training. We found that both the continuous learning agent and the noncontinuous learning agent outperformed the fixed human-tuned coefficients. Furthermore, the continuous learning agent significantly improves the performance of noncontinuous agent (p ( 0:05 from analysis   of variance tests). The improvements are not only the reduced average navigation time to the goal but, more importantly, the significantly reduced average number of collisions during the navigation. Besides, the results have shown that our approach naturally transfers the knowledge learned in the training environments to completely different new environments. The differences can be the numbers of static and unknown moving obstacles or the structures of environments. Figure 9 shows the curves of the coefficients tuned by the continuous learning agent and the snapshots at three different time instants during one of the test episodes in Te. B. The corresponding places of the snapshots in Te. B are marked in Figure 7(b) to (d). The horizontal "plan number" means the number of planning cycles. At the start of the test episode, the values of the coefficients are set randomly. In snapshot (b), mixed static and dynamic obstacles are being dealt with. In this case, drastic orientation changes can easily result in collisions, and there is no space for the robot to keep far away from the obstacles. Hence, the penalties for the orientation change W A ð Þ are increased with the decreased penalty for the distance to obstacles W D ð Þ. In snapshot (c), the robot is navigating a maze-like area, where keeping away from the wall is the key to ensuring safety, and frequent turnings are necessary to pass through the maze. Therefore, the penalty for the distance to obstacles is increased and the penalties for the orientation change are decreased. In snapshot (d), the robot will arrive at the goal soon. Both the penalties for the orientation change and the distance to obstacles are decreased, which means that the importance of the time efficiency W T ð Þ is increased. The robot now will move along a trajectory as short as possible to reach the goal.   Moreover, from the results we can tell that in the same environment, when using the RL agent to tune the coefficients of cost functions online, more frequent coefficient changes often result in more efficient and safer navigation. The frequency depends largely on the changes of the environment during navigation, such as the number of moving obstacles. For example, the number of moving obstacles in Te. C remains fixed all the time, so the number of coefficient changes needed in this environment is smaller than that in the other environments, even though the environment itself is cluttered. Note that if the number of the moving obstacles in the working zone is fixed and the obstacles move along fixed trajectories, the coefficients     in the cost function will converge after training for some time. This means that the number of coefficient changes in such environments will be nearly zero after convergence, because both the number and the motion pattern of the obstacles become unchanged.

Real robot experiments
Real robot experiments were performed with a Turtlebot 2 platform. The experiments were ran on a sequence of 3.5 m 2 environments. As the robot traversed the environment, the continuous learning agent modified the coefficients at real-time until the robot reached the goal. For each subsequent environment, the initial coefficients were set to the final coefficients when the robot reached the goal in the previous environment. The first environment contains no obstacles. The robot is easily able to move on a straight line to the goal. The second environment contains four static obstacles. The third environment contains two static obstacles and one dynamic obstacle moving on a straight line back and forth. The fourth environment contains two dynamic obstacles that move on straight line trajectories repeatedly and no static obstacles. An image for each environment containing obstacles is shown in Figure 10. Figure 11 shows how the coefficients change over time, while the robot is executing motion. In general, the W D coefficient (minimum distance to obstacles) changes significantly and often, but the W A coefficient (orientation change) changes only slightly. While moving in the presence of no obstacles, the W D coefficient decreases rapidly. For each new environment, the W D coefficient changes significantly throughout the run. The fluctuation of the W D coefficient is likely due to the robot having to perform more obstacle avoidance behavior while navigating obstacles, and then requiring less obstacle avoidance after it passes obstacles and approaches the goal.
Comparing the results shown in Tables 9 and 10, under  the RL agent, and in Tables 11 and 12, under manually tuned coefficient values, it is clear that the reinforcement agent results in more efficient and less conservative robot motion in the environment with static obstacles and safer motion in the environment with dynamic obstacles. Note that the RL agent was trained entirely in simulation and with different environments. No further training of the agent was done before running it in the real experiments.
The purpose of this is to verify the agent's ability of transferring the knowledge learned in the simulations to the real world. With further training of the agent in real experiments, we expect that the robot will have better performances.

Conclusions
We have introduced a utility-based multi-objective continuous learning framework utilizing a real-time motion planner to make a robot continuously learn how to adapt online multi-objective optimization for robot motion. We focus on improving and testing our framework for continued learning and adaptation in changing dynamic environments. Moreover, by the simultaneous robot motion and continuous coefficient-update model learning, our framework enables the robot to self-tune its behavior online to constantly adapt to unknown changes in task environments with ever improved performance. The effectiveness and practicality of the proposed framework have been demonstrated by a case study with the RAMP, through both the simulations and the real robot experiments in various dynamic environments. In comparison with the humantuned coefficients, the proposed framework improved the execution time about 17% on average in the simulations and 10% on average in the real robot experiments. The encouraging results verify the performance gain in the robot navigation from our framework and the transferability of the trained agent from training environments to unseen environments, and even from simulation to real environments. As such, the trained agent acquires a general ability for effective navigations in different environments. The application of the proposed framework is not restricted to the case study with RAMP shown in this article. Our framework can accommodate other real-time motion planners and enable stability and intelligence for the robot navigation. In the real-time motion planning problem of mobile robots, the role of RL is making decisions at a semantic level for the robot navigation. One of the ways to make such decisions is tuning the behavioral multiobjective coefficients online based on the performancedriven feedback. In this way, the RL agent is able to determine the time-varying preferences among different objectives at different navigation time.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.