BOLeRo: Behavior optimization and learning for robots

Reinforcement learning and behavior optimization are becoming more and more popular in the field of robotics because algorithms are mature enough to tackle real problems in this domain. Robust implementations of state-of-the-art algorithms are often not publicly available though, and experiments are hardly reproducible because open-source implementations are often not available or are still in a stage of research code. Consequently, often it is infeasible to deploy these algorithms on robotic systems. BOLeRo closes this gap for policy search and evolutionary algorithms by delivering open-source implementations of behavior learning algorithms for robots. It is easy to integrate in robotic middlewares and it can be used to compare methods and develop prototypes in simulation.


Background and related work
Designing behaviors for robots is tedious work. Even though planning is possible, it usually requires a detailed description of the solution and an accurate model of the world. Machine learning and optimization can mitigate this problem by learning in reality or exploiting simulations. Behavior optimization and learning for robots (BOLERO) is designed to support the process of research and development of behavior learning and optimization for robots. Its main goals are (1) to be a benchmark framework for behavior search algorithms (learning and optimization), (2) to provide a set of reference implementations for benchmark problems, behavior representations, and well-established and reliable behavior search (learning or optimization) algorithms, (3) while it should be easy to integrate behaviors or behavior search algorithms in real robotic systems.
Some of BOLERO's goals overlap with the goals of other open-source software packages: some focus on behavior learning algorithms and others on benchmark problems. A major problem is that many state-of-the-art experiments are hardly reproducible, particularly if robotic simulations are involved. Small changes in the simulation setup can significantly influence the problem complexity and quite often not all simulation details are published. For example, changing contact parameters for a walking system might influence the preferred walking pattern. In BOLERO, it is easy to create environments with the MARS simulation software (https://github.com/rock-simulation/mars).
Hence, it simplifies the publication of reference implementations of robotic learning environments. A more recent development that covers the same aspects as BOLERO is OpenAI Gym. 1 It provides very simple environments and complex robotic environments for reinforcement learning (RL) to improve reproducibility of experiments and enable rigorous comparison of methods. There are other works that build upon OpenAI Gym, for example, 2 extend it with various simulated robotic scenarios.
There are several behavior learning libraries that focus on classical RL, 3,4 which often does not work out of the box for robotic problems. Although there are certainly works that apply standard RL 5 on real robotic systems, it has been struggling with challenging problems posed by the real world for a long time: continuous state and action spaces, complex dynamics, noisy sensors, sparse rewards, and above all the demand for sample efficiency. A more recent branch of RL is deep RL. 6 Deep RL is very interesting for robotics because it handles complex sensors like cameras easily. There are some open-source implementations available, for example, OpenAI Baselines, 7 Dopamine, 8 TF-Agents, 9 or Garage. 10 It is, however, not easily applicable to learning on real robotic systems yet. Currently, there is no established evaluation procedure in the community 11 and it is hard to tell whether a method is robust enough to be worth the effort of implementing it for roboticists. However, its main problem is sample efficiency. For example, hindsight experience replay (HER), a recently published method, 12 needs about 40,000 episodes to learn how to push a puck on a table to multiple goals with approximately 95% success rate. A breakthrough for deep RL in robotics has been achieved by Levine et al., 13 but this achievement relied on additional pretraining procedures, a fully observable state space during the training phase and additional methods from optimal control, which makes the whole process of generating new behaviors very complicated and difficult to engineer. Publicly available implementations are often either research code or rather designed to learn behaviors of agents in simple virtual worlds, for example, video games. We see a niche here for a library that provides implementations of established behavior search algorithms that work well for robots and that is designed to support the behavior development workflow for a robot and we propose BOLERO as a solution. Reliable and successful solutions come from the fields of RL, 14-23 optimal control, 24 and black-box optimization. [25][26][27][28][29][30] These approaches often rely on special types of policies for robotic problems, for example, dynamical movement primitives. [31][32][33][34][35] For details on RL in robotics please refer to Kober et al. 36 BOLERO provides implementations for several of these algorithms.

Design and features
One of the main design decisions of BOLERO is to keep interfaces rather simple, which allows to quickly integrate external methods into it. Decoupling through interfaces also provides flexibility in combining methods and applications. It also enables us to easily combine Python and Cþþ. There are Python bindings for Cþþ components and, vice versa, it is possible to run Python components from Cþþ via the BOLERO interfaces.
BOLERO can be used as a library that provides behaviors, behavior search algorithms (learning and optimization), or simulation environments that represent behavior learning problems. It can also be used as a framework for comparison of behavior learning algorithms. Figure 1 illustrates how BOLERO can be used as a benchmark framework. An environment defines the learning problem. A behavior is evaluated in the environment in a control loop: in each step, the behavior observes the current state of the environment, computes an action, and the action is executed in the environment. After the control loop completes, which is indicated by the environment, feedback (e.g. reward or fitness) is given from the environment to the behavior search algorithm. The behavior search uses feedback to generate new behaviors in a learning loop. This episodic learning process is implemented by a controller in BOLERO. This can be done with a Cþþ controller that loads benchmark configuration files or we can use Python scripts to define the benchmark (see Algorithm 1, for an example). In this case, the control flow is defined by BOLERO and it is used as a framework, but it is not the only way to make use of the individual components.
It is not required to change the design of the existing software to use BOLERO. It can be used as a library that provides behaviors, behavior search algorithms, or environments. For integration in robotic middlewares, we suggest using BOLERO as a library and let the developer define the control flow based on the robotic framework. The transition from learning in simulation to a robotic system is a major challenge. The choice of languages, methods, dependencies, and interfaces often complicates this matter. Most robotic frameworks use Cþþ as their core language (e.g. ROS 37 ). Python on the other hand is one of the most important languages for machine learning with a great ecosystem for scientific programming. We want to make BOLERO compatible with both worlds. It is possible to use it in both Cþþ and Python. We describe the BOLERO interfaces and available implementations in the next paragraphs.

Environments
Environments in BOLERO define a behavior learning problem. Having a robotic simulation to test the feasibility of approaches, new methods, or to be able to train complete behaviors that can be used on real robots is necessary in a behavior learning library for robots. There are various simulations that can be used to define robotic environments, for example, MARS, Bullet (http://bulletphysics.org), Gazebo, 38 v-rep (http://www. coppeliarobotics.com), or MuJoCo. 39 Communication with simulations is often not easy and not all simulations are suitable for behavior learning. We prepared a base class for environments that use the MARS simulation software in BOLERO. With graphical tools like Phobos (https://github.com/rock-simulation/phobos), new robotic environments can be defined. An example is shown in Figure 2. Implementing new environments is also possible with the physics engine Bullet, for example, via the convenient pybullet API.
Since the key idea of OpenAI Gym 1 to provide reproducible learning environments correlates to the concept of BOLERO, a coupling of OpenAI Gym is provided by BOLERO. There is a wrapper available in BOLERO for OpenAI Gym environments.
Behaviors BOLERO provides implementations for recent movement primitives like Cartesian space dynamical movement primitives 35 and probabilistic movement primitives. 40

Behavior search algorithms
Behavior search algorithms often combine behavior representations (e.g. neural networks in deep RL and neuroevolution or movement primitives in policy search) with behavior optimization (e.g. policy gradient algorithms in deep RL or black-box optimization in episodic policy search).
Implementations of policy search algorithms are available in BOLERO, for example, episodic 16 and contextual relative entropy policy search (REPS 20 ), covariance matrix adaptation evolution strategies (CMA-ES), 26 ACM-ES (CMA-ES with a ranking support vector machine as surrogate model) 41 and contextual CMA-ES, 30 and natural evolution strategies (NES). 42 A wrapper around scikit-optimize, 43 a library for model-based optimization, is integrated. An additional package (https:// github.com/rock-learning/bayesian_optimization) for Bayesian optimization 27 and Bayesian optimization for contextual policy search (BO-CPS) 29 is available.
Stepbased RL and deep RL algorithms are planned as next features.

Organization and source code
BOLERO currently has two maintainers. It is an opensource software and we would like to encourage everyone to contribute to the project by reporting issues or submitting pull requests to the public git repository. For example, new environments, behavior search algorithms, behavior representations, improved documentation, and tests would be very helpful contributions. BOLERO is a modular system, and it is easily possible to write extensions without having write access to its core repository. It is only required to implement BOLERO's interfaces to use it.
Source code of BOLERO is released at https://github. com/rock-learning/bolero under 3-clause BSD license. At https://rock-learning.github.io/bolero, a detailed documentation is available. We support Ubuntu 18.04 LTS (although it also works with other versions), Mac, and to some extent Windows.

Examples and applications
In this section, we will show how BOLERO can be used in practice. For example, we present an experiment in a simple Python script and a more complex study with a simulated walking robot.

Simple example
An episodic learning process (see Figure 1) can be organized by a controller in a simulation. A simple source code example is shown in Listing 1. The behavior in this case is a dynamical movement primitive (DMPBehavior), the behavior search is a black-box search (BlackBox-Search), which uses the black-box optimization algorithm CMA-ES (CMAESOptimizer) to modify the weights of the DMPBehavior. The environment is a simple trajectory planning problem in 2D (OptimumTrajectory): Three circular obstacles have to be avoided on the path from start to goal while minimizing acceleration. The corresponding learning curve is displayed in Figure 3 on the left side. The environment, the final trajectory, and several intermediate solutions are shown on the right side.

Learning to walk
Another example application is the development of locomotion patterns for a legged robot. The purpose of this section is to describe a benchmark for legged locomotion. The benchmark can be used to compare different approaches concerning the controller of a legged system, the learning algorithm, and the evaluation function used. This is of interest since legged locomotion is a complex problem and many publications in that field are hard to compare due to different simulation setups used. For this benchmark, a four-legged walking robot is designed which is shown in Figure 4. The physical properties of the system are chosen such that it could represent a real robotic system. The motors are controlled by a position controller defining the speed of the motors, while the required motor torques to generate the speed are produced by the simulation MARS. For the benchmark, a general locomotion environment is implemented (https://github.com/rock-learning/ locomotion_environment). The configuration of the environment allows to specify evaluation criteria such as motor torques, structure load, velocity, or measured feet slippage. Additionally, a model for the robot to load and a scene defining the simulated environment can be specified. For a first experiment on this benchmark, the genetic algorithms SABRE 44 is used to optimize a controller defined in the graph-based programming language BAGEL. 45 In BAGEL, a control algorithm is implemented through hierarchical dataflow graphs where the nodes represent "atomic" computations or a lower level BAGEL graph. Due to this representation of algorithms, a learning method can adapt the algorithm structure together with its parameters. A central   pattern generator (CPG) 46 is designed with BAGEL. The general concept of the controller is shown in Figure 5. The CPG produces a global phase with 0 1, multiple phase shifts i and local phases i . It is used as a base pattern generator, whereby the projection jðÞ from the local phases to the angular joint patterns is generated by the genetic algorithm SABRE. 6. As representation for joint trajectories a Gaussian pattern approach is used. Hence, the genetic algorithm can generate multiple parameterized Gaussian core patterns that are overlaid by a weighted sum to produce a joint trajectory. The Gaussian pattern approach is similar to a radial basis function network, but optimized to generate a periodic function and to support asymmetric shapes.
Each behavior is evaluated in the simulation for a test period of 30 s. The first second is ignored for calculating the fitness value, removing the first acceleration phase of the robot from the stability evaluation.
The evaluation is aborted if other parts of the robot than the feet have collisions (ngc > 0). In that case, the fitness is defined by a constant value (100,000,000) subtracted by the simulation time in milliseconds until the collision is detected. Otherwise, a feet slippage percentage f s is combined with two fitness terms f t 1 and f t 2 as defined by f ind The first term f t 1 includes the average motor torques J and joint loads r (torques applied to the joints not on the motor axis) divided by the final forward position of the robot p x . The robot position is limited to a value of 3 m. Thus, after the optimization creates behaviors that reach the 3 m mark, it continues to optimize stability instead of producing faster individuals. Furthermore, to prevent to large Figure 5. Basic concept for the CPG-based joint trajectory generation. The global Phase Generator module is not depicted though its output (Global phase) is plotted in the first graph. Based on the global phase the Pulse module generates a pulse pattern that is transferred to a "local" phase in the Slope Generator module. The local phase is aligned by the pulse pattern and it is transferred to a joint pattern in the Joint Trajectory module. fitness values or a division by zero, the robot position is cut to positive values starting from 0.01 m. To include the starting position at 0 m an offset of 10 cm is added previously to the robot position.
The second term f t 2 represents the stability fitness and includes the standard deviation of the motor torques s J , the joint loads s r , the roll s a and pitch s b angle of the robot, the robot's velocity s v x , and the height s p z of the torso. The first term represents the general goal to move the robot forward with minimal effort and minimal stress on the mechanics. The second term increases the focus on steady behaviors to reduce the probability for the optimization to converge in local optima with very dynamic, fast, and unstable behaviors. The two fitness terms are defined by and A result generated with the proposed setup is depicted in Figure 6. The evolved controller produces a stable walking behavior and manages 4.5 m in 20 s including a short acceleration phase in the first 2.5 s resulting in an average walking speed of 0.25m/s after the acceleration.
The robot model used for this experiment can be found at https://github.com/rock-simulation/easy4_robot. The source of the behavior representation BAGEL, the genetic algorithm SABRE, and the generic locomotion environment is in preparation to being published.
An example configuration file is shown in Listing 2 whereby the configuration for the genetic algorithm is given in a separate configuration file. Langosz provided a more detailed information of SABRE, BAGEL, the locomotion environment, and how they can be used to generate locomotion solutions. 47

Reproducible research
Reproducing results from publications is a notoriously hard problem in the domain of robotics and machine learning. One of the goals of BOLERO is to provide well-tested Listing 2. A minimal configuration file for the locomotion learning example.  Figure 1 of. 30 Code and documentation is available at https://github.com/rock-learning/ bolero/tree/master/benchmarks/ccmaes.
implementations of algorithms and test applications. If a simulation tool and model is used to produce scientific results, the experiments can often not be reproduced if the tools and models are not published in the exact configuration they were used. To provide the exact learning configuration, used in publications based on BOLERO, is one of the main goals of the framework. Another idea that we pursue with BOLERO is to produce reference implementations of existing published algorithms. An example of how we imagine this to be done is the reimplementation of contextual CMA-ES. 30 The learning curves for C-REPS and C-CMA-ES in Figure 1(a) and (b) of the original publication could be reproduced with BOLERO's implementation of the algorithm (see Figure 7).

Conclusion and outlook
BOLERO provides a development and test environment for new learning approaches and scenarios. We split learning problems, learning algorithms, and behavior representation via defined interfaces which makes BOLERO open for extensions and its parts reusable in other settings. We provide easy-to-use interfaces that support many of the common learning setups used in RL and evolutionary computation. Additionally, using the learning controller, BOLERO is best suited to perform benchmarks of learning methods.
A benchmark framework that allows to configure a set of learning problems and algorithms is currently in development. The framework will automatically summarize results and evaluations by creating some default statistics and plots. Furthermore, it will provide a cloud computing interface to distribute a set of defined benchmarks or single experiments. The latter allows the distributed evaluation of a whole population of an evolutionary application.

Acknowledgments
The authors would like to thank Jan Hendrik Metzen in particular for initiating the conception of BOLERO as a collaboration between researchers in RL and evolutionary algorithms and significant contributions to the software. The authors would like to thank Patrick Draheim for helpful feedback on an earlier version of this manuscript and our anonymous reviewers who significantly improved the quality of this article.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported through four grants of the German Federal