1. Introduction
We see today an increasing proliferation of mobile robotic systems in real-world scenarios. To be able to deploy autonomous machines in our vicinity, there is a clear need for robust control designs that can handle unforeseen situations and perform safely in complex environments. In this work, we introduce a sampling-based control algorithm for a general class of stochastic dynamical systems. Our method highlights a path to fulfill the requirements on modern control designs using a combination of reactive online control and data-driven offline learning.
It is usually intractable to analytically devise a globally stable feedback controller for a system with multiple, non-linearly coupled states and inputs, that is operating in a partially unknown environment. Accordingly, in recent years, almost all sophisticated control designs for complex robotic systems in real-world scenarios can be associated with one of two fundamental design strategies: model-based optimal control (OC) and data-driven reinforcement learning (RL). Both strategies come with their own set of difficulties and compromises.
OC methods require the specification of an underlying model, where the complexity of the model must strike a balance between multiple, conflicting objectives (
Chung et al., 2016): On the one hand, the model must capture the essential dynamic behavior of the system and its interaction with the environment. On the other hand, it needs to be simple enough to create a smooth optimization landscape such that the solution is not sensitive to the initialization and gradient updates do not become trapped in an undesired local minimum. The dependence on gradients limits the types of scenarios that can be modeled and requires that derivatives can be computed. The model also needs to be computationally lean to achieve the fast update rates of the controller that are required to achieve robustness against disturbances. In contrast, data-driven methods, such as RL, in general do not require the manual specification of a model, but require careful attention to the data generation process to find a balance between sample complexity and sufficient realism. This is especially true when applying RL to real robotic systems (
Peters et al., 2016). How to systematically devise a suitable data generation pipeline for learning complex control policies is an open problem and an active field of research.
Furthermore, data-driven methods have only become viable candidates for online control of higher-dimensional systems after introducing function approximation, i.e., fitting a parametrized control law, which may be evaluated rapidly online, to the data generated in a costly offline process (
Bertsekas, 2005;
Sutton and Barto, 1998). The decisive advantage of learning a controller from data is that there are minimal restrictions to modeling the robot and its environment because the model does not need to be in an analytical form. Hence, a general-purpose physical simulator is often used in the robotics context. Subsequently, it requires little additional effort to represent complex scenarios such as uneven ground geometries or random disturbances in the problem formulation and prepare the learned policy for these contingencies.
Beyond modeling of the system and environment, a mathematical representation for the task at hand must be formulated when constructing an OC or RL problem. In data-driven algorithms, the task specification is almost exclusively embedded in the cost (or reward) function. For OC the complexity of the cost function is usually limited due to the differentiability requirement and the danger of local minima. However, a powerful feature of OC is its ability to respect constraints. Such conditions can be used to encode a goal, safety boundaries, or to prioritize a hierarchical set of tasks, all of which is typically difficult to guarantee in an RL algorithm.
In this article, we introduce a way to combine the ease of modeling realistic scenarios with the constraint-handling abilities and reactiveness of online OC. Our work is founded on the theory of path integral stochastic optimal control (PISOC) (
Kappen, 2005;
Todorov, 2006): At its core, our method infers the optimal input by drawing trajectory samples from a stochastic system. The underlying dynamics model may thereby be represented by an arbitrarily complex formulation, such as a complete simulator. These noisy rollouts provide a principled way to encode uncertainty, which translates into robustness of the optimized control commands against disturbances, model mismatches, and imperfect actuation. The stochastic model also ensures exploration into unseen states, thereby providing information on favorable locations in the cost landscape without the need for a smooth and differentiable problem description.
While sampling in continuous, high-dimensional spaces is generally prone to failure, we use two mechanisms in our work that reduce the sampling complexity and allow us to execute our algorithm in real-time. First, we extend the path integral (PI) formalism to explicitly handle equality constraints. Our results show that such constraints can be dealt with using a projection approach, which significantly reduces the sampling space and provides a powerful mechanism to introduce domain-specific knowledge. Second, we use importance sampling by integrating a known (sub-optimal) controller into the sampling procedure. This underlying policy steers the sampling distribution and increases the informational content for each sample. If such an ancillary controller is not known or not easily obtainable, e.g., by solving a simpler deterministic OC problem, it is possible to use the exact same PISOC algorithm in an offline setting to train a feedback policy, which eventually becomes a suitable importance sampler.
Compared with deterministic model predictive control (MPC), our algorithm benefits from advantages that are typically attributed to data-driven methods: as no derivatives of the system dynamics are required, we achieve a high degree of flexibility in modeling and can directly integrate complex physics engines into the controller. This allows us to model disturbances beyond uncertain dynamics, for example, arbitrary obstacles and uneven ground geometry. Furthermore, the stochastic problem setting creates an automatic balance between robustness and performance, which would otherwise require careful tuning of the cost function. It also leads to exploration, which is difficult to achieve in deterministic control, particularly if exploration needs to happen across non-smooth events such as contact switches. Finally, the sampling procedure, which consumes the bulk of the computation time, can be parallelized entirely on modern hardware.
We reap these benefits while being able to retain several merits from model-based OC methods. Most importantly, we can respect equality constraints exactly through a projection approach while allowing the flexible integration of other constraints in the dynamics or cost function. Our algorithm is fast enough to be run online and thus lends itself naturally to receding horizon control. Combined with its ability to include any existing control law as an ancillary controller, we can flexibly build on established control schemes and advance them with robust online adaptation through our method. Such a combination can be powerful in the context of robots interacting with their perceived environment, for example, by fusing an existing model-based proprioceptive controller with our sampling procedure to achieve online adaptation to terrain variations and obstacles. In addition, if an ancillary controller is too slow for online execution or even no such feedback policy is known a priori, the same sampling procedure can be used in an imitation learning (IL) or RL setting to bootstrap itself. Therefore, our algorithm can be viewed as an online sampling-based planner or an offline learning algorithm depending on the execution setting.
Our developed method is broadly applicable to a wide range of robotic systems. In this work, we highlight its performance on three exemplary systems of increasing difficulty level: a fixed-base robotic arm, a mobile ball-balancing robot, and a quadrupedal walking robot. In particular, legged robots belong to the most challenging class of robotic systems owing to their strongly non-linear and high-dimensional dynamics and non-smooth contact switching behavior. Moreover, they are intended to be operated in the most challenging and uncertain environments. For this reason, we choose this class of robots as one of the three subjects of our experimental study. Our experiments indicate that our method’s advantages in terms of flexible importance sampling and ability to handle constraints become especially pronounced for walking systems.
1.1 Statement of contributions
This article makes a case for solving stochastic OC problems using a hybrid approach based on classical control theory, receding horizon PI control, and machine learning. We provide a way to reap the modeling flexibilities of sampling-based methods while retaining the constraint-handling and real-time capabilities of model-based MPC. In that respect, the PI formality provides a natural way to combine policy learning ideas with classical model-based or hand-tuned controllers. The resulting algorithm presented in this article can be applied to the general class of non-linear, control-affine dynamical systems. The computed solution respects state-input equality constraints.
Specifically, this article presents the following contributions.
•
We extend the PI algorithm to handle time-variant state-input equality constraints.
•
We demonstrate that the PI sampling complexity for unstable systems is reduced by introducing constraints and importance sampling.
•
We validate that our measures yield a sampling efficiency that is high enough for online control of a quadrupedal robot.
•
We showcase how the developed algorithm can also be used in an offline setting to (pre-)train a sampling policy.
•
We illustrate how our PISOC algorithm can exploit the freedom in modeling by using a physics simulator to provide state trajectories in complex environments.
•
We conduct experiments on a quadrupedal robot in combination with a stepping stone terrain to demonstrate the effectiveness of our real-time sampling algorithm.
2. Related work
The core of our algorithm is based on PISOC. The theory of PI control (
Kappen, 2005;
Theodorou, 2015;
Todorov, 2006) suggests a mathematically elegant procedure to determine optimal controls while avoiding parametrized representations of the value function and control policy. Under a few regularity assumptions and the application of the Feynman–Kac formula (see
Øksendal, 2003), the theory states that the evaluation of the OC at any given state is given by a conditional expectation value.
In discrete state and action spaces, it is conceivable that the expectation can realistically be evaluated for the finite set of state–action pairs (
Azar et al., 2012). However, for most stochastic control problems computing this expectation in closed form is intractable. To address this problem, different analytical options have been proposed in the literature, for example, the Laplace approximation (
Kappen, 2005) or the method of implicit sampling (
Morzfeld, 2015). These methods presume that an analytical description of the underlying system dynamics is available. If this is not the case, the most frequently used approach for estimating the expectation value is Monte Carlo sampling. Drawing a sample in this context means computing a rollout of the stochastic system dynamics over the specified time horizon. Except for systems with limited complexity, the corresponding sampling space is vast, thus rendering nave sampling schemes unusable owing to exceedingly high variance of the estimation.
Intuitively, what makes sampling difficult is that we are hoping to discover a sensible behavior from a system that is purely driven by noise. The application of Girsanov’s change-of-measure formula provides a remedy by revealing how the sought expectation value can also be obtained from sampling a stochastic process that is driven by noisy controls from an arbitrary controller (
Doucet et al., 2001). Such a controller can be the solution from the previous iteration of the algorithm or also any externally given (feedback) policy. Variations of this idea have been explored in the literature, for example,
Stulp and Sigaud (2012),
Thijssen and Kappen (2015),
Kappen and Ruiz (2016), and
Williams et al. (2017,
2018b). In our work, we also make extensive use of this form of importance sampling.
Another family of methods that avoid sampling in high-dimensional action spaces are those related to the seminal Policy Improvement with Path Integrals (PI
2) method by
Theodorou et al. (2010). Their key idea is to reformulate the stochastic optimal control (SOC) problem in a way such that the control action becomes a set of parameters of a feedback controller. The cardinality of the sampling space thereby reduces to the size of a single parameter vector, which is typically much smaller than that of the original control space (input dimension times number of time steps). Many works have followed up on the idea of optimizing parameters through PI ideas, for example,
Buchli et al. (2011),
Kalakrishnan et al. (2011), and
Pastor et al. (2011).
In a second line of work, an information-theoretic viewpoint on SOC was developed in the literature, which avoids the notion of a PI. From this perspective, SOC can equivalently be formulated as an inference problem (
Kappen et al., 2012;
Theodorou and Todorov, 2012;
Todorov, 2009;
Toussaint, 2009). Therein, the optimal policy is computed through the minimization of the Kullback–Leibler (KL) divergence between the optimal state distribution and the state distribution that is arising from applying the current policy. Although still not analytically solvable, this minimization can be attempted iteratively, yielding practical algorithms for optimizing parametrized feedback controllers. In our work, we build on ideas from
Gómez et al. (2014),
Thijssen and Kappen (2015), and
Kappen and Ruiz (2016) to devise such a built-in optimization mechanism for learning a parametrized feedback policy that is subsequently employed as an importance sampling mechanism.
The cross-entropy view of PI naturally suggests a probability-weighted update rule to the learnable parameters of a sampling policy. The idea of using probability-weighted averaging rather than gradient estimation is shared with numerous approaches that derive from covariance matrix adaptation evolution strategy (CMA-ES) and similar cross-entropy arguments (
Stulp and Sigaud, 2012). More interrelations between SOC, online learning, and statistical inference are investigated by
Rawlik et al. (2012) and
Wagener et al. (2019). Interestingly, the learning component of our method also bears similarity to the popular soft actor–critic (SAC) (
Haarnoja et al., 2018), an off-policy RL algorithm. A difference is, however, that we use an online estimate of the value function based on the current set of samples instead of learning a
-function. Furthermore, we do not employ function approximation except for the learned policy itself. There are many instances of PI control in a robotic context, both for parameter estimation as well as direct control, for example
Sugimoto and Morimoto (2011),
Rombokas et al. (2012),
Theodorou et al. (2010),
Rawlik et al. (2013),
Williams et al. (2017),
Heijmink et al. (2017),
Okada and Taniguchi (2018),
Hoeller et al. (2019), and
Abraham et al. (2020). In typical use cases, the modeled optimization problem is an unconstrained one. Many successful PI control applications have been shown on systems with a purposefully designed input space or where a set of default inputs already yields stable behavior (
Goldfain et al., 2019;
Theodorou et al., 2010). The use of iterative learning and importance sampling allows us to directly apply our method to the system’s continuous input space without an intermediate representation, such as movement primitives.
One disadvantage of using a sampling-based over a gradient-based OC algorithm is an increase in computation cost and, accordingly, a slower algorithm. Fortunately, the nature of the PI algorithm lends itself naturally to parallel computation because the bulk of the computation time is spent in forward-simulating system trajectories under independent noise realizations. Therefore, parallelization of the sampling procedure is the primary mechanism to achieve the required computational efficiency for running PISOC solvers in real-time.
Williams et al. (2017) have proposed an efficient GPU implementation for online PI control, termed model predictive path integral (MPPI). Although our work does not claim to be of the same implementation maturity, we follow the same general idea of using receding horizon PI control with parallel computation to offset the computational burden of the sampling-based approach.
3. Preliminaries: PI control
We begin by recalling the theoretical background of our method and the notation used throughout the article. The underlying theory of PI control resides on linearly solvable Markov decision processes and has been introduced by
Kappen (2005) and
Todorov (2006).
3.1. Problem formulation
PI methods operate on SOC problems defined by the control-affine dynamics
which are driven by exogenous input vector and normalized white noise of the same dimension. The flow map defines the dynamics of the system’s state absent of any inputs and stochasticity. The system’s evolution (1) is a Brownian motion with diffusion term and drift . This formulation ensures that the noise enters through the input channel only, which is a requirement of the PI formalism and means that only controllable degrees of freedom will be perturbed .
The scalar cost function of our problem takes the form of an expectation of path costs over all possible noise realizations
where denotes the expectation over all trajectories that start at and are governed by the dynamics (1). We use the notation to denote the range of times from to . We assume a positive-definite and symmetric control cost matrix . The potential and final cost may be arbitrary function s of the state and can include areas with infinite potential, for example to model obstacles or unsafe regions.
In short, the SOC problem can be stated as
3.2. Optimal cost-to-go
The optimal cost-to-go is a function that represents the minimal cost (in expectation) that can be achieved when starting from state at time :
Solving the OC problem entails finding the optimal cost-to-go. The optimality conditions of this minimization lead to a non-linear partial differential equation (PDE) named Hamilton-Jacobi-Bellman (HJB) equation. Finding a global solution to the HJB is generally intractable and not expedient for control applications. Luckily, under the condition that the noise magnitude is inversely related to the control cost,
i.e., the control effort is cheap in directions where the noise is strong, being a proportionality constant, the second-order terms in the HJB cancel. Under these conditions, there exists a variable transformation of the form
that linearizes the PDE in terms of . The exponentiated cost-to-go function is commonly known as the desirability function.
The second practical aspect that eases the computational demands is that control applications typically only require the OC at the current state. In combination with the mentioned linearization of the HJB, it can be shown (
Kappen, 2005) that the point-wise optimal cost-to-go can be written as an expectation over all possible paths that the uncontrolled (passive control policy
) stochastic dynamics may take
The operator thereby signifies the expectation over all trajectories that start at and are governed by policy and dynamics (1).
The optimal input at the current time and state is related to the derivative of the optimal cost-to-go via the relationship
This formula is generally true for continuous-time OC problems with quadratic control cost and control-affine dynamics (
Bertsekas, 2005).
Thijssen and Kappen (2015) prove that in our stochastic setting the derivative of
may also be written as an expectation operator, which yields a formula for the OC analogous to the optimal cost-to-go:
4. Method
Having established the theoretical foundations, we now present our core method (Section 4.1) and several extensions that make PI control applicable to a wide range of robotic systems. Of these extensions our results show that temperature tuning (Section 4.2) and importance sampling (Section 4.3) are crucial, whereas the learning aspect (Section 4.4) is a useful but optional enhancement.
4.1. Constrained PI control
The first crucial ingredient of our method is the generalization of the PI formalism to constrained problem settings.
Williams et al. (2018a) mentioned that inequality constraints that are purely a function of the state, such as joint limits, can be handled through suitable barrier or indicator functions inside the potential
. L imits on the control commands may be enforced by quadratic penalties to keep the overall path cost quadratic in
.
In the following, we focus specifically on the treatment of state-input equality constraints because these conditions can be handled through a projection scheme. This is particularly beneficial for our method because it results in a dimensionality reduction of the sampling space. Detailed derivations for this section can be found in
Appendix A.
We consider problem (4) with a slightly more complex path cost
Similar to , the additional coupling term can be an arbitrary function of the state. Furthermore, our problem includes an additional (-dimensional) control-affine equality constraint of the form
under the assumption that has full row-rank for all attainable states and being an arbitrary function . To keep our notation compact, we henceforth drop the explicit and dependence for most terms.
The resulting definition of the cost-to-go is analogous to (5), except that the minimization is constrained by the additional condition (12), which leads to the expression
We can now use the method of Lagrange multipliers and rules from Itô calculus to derive optimality conditions that respect the dynamics (1) and the constraint (12). For the input this procedure results in the expression
whereas the constrained HJB equation reads
Here, we have used the following projection operators and shorthand symbols to keep the notation concise:
When comparing (14) with (9), one can observe that the control is constraint-satisfactory by design due to the projection operator . Note that we denote open-loop controls by and distinguish them from feedback policies that generally depend on the state. Here is the constrained equivalent to the passive policy above, i.e., it minimizes the control cost subject to the constraints but is ignorant of the state cost .
As indicated in the preliminaries, there exists a log transformation for the optimal cost-to-go, which defines the desirability function . This transformation, together with a structural assumption about the noise characteristics
linearizes the HJB in terms of
. This linearization procedure gives rise to the name “linearly-solvable Markov decision problems” (
Todorov, 2006). The standard assumption taken in (19) means that the noise magnitude is inversely proportional to the control cost and that noise cannot act in constrained directions. The parameter
relates the overall noise magnitude to the control cost. As noise is usually regarded as a driver of exploration rather than a property of the underlying system, (19) merely dictates how the noise covariance
should be calculated.
According to the Feynman–Kac formula, the point-wise solution of this linear PDE can be estimated by an expectation operator (
Kappen, 2005)
with scalar where the expectation is over all trajectories that start at at time and are governed by the diffusion
The expectation (20) may be computed by any desired statistical method. As our algorithm applies to general system dynamics whose functional form may not be known, we choose to evaluate the expectation through Monte Carlo sampling. Importantly, this choice imposes minimal restrictions on the dynamics and cost terms and only requires that we have access to the evaluations of these quantities. Notably, this permits the use of implicitly given system dynamics, for example, through the evaluation of a physics simulator.
Finally, the optimal input to apply to the system at the current state follows from the state derivative of the value function and reads
where the first term in the expectation is the noisy input applied to the system in the first instance of the respective sampled path. Note that the computation of the OC is numerically robust against diverged samples because their high cost automatically makes their contribution to the expectation negligible. Alternatively, and with the same effect, failed samples may directly be discarded. The exact definition of failed samples is dependent on the problem context but generally corresponds to reaching a state from which the system cannot recover.
4.2. Automatic temperature tuning
When computing the expectation by Monte Carlo Sampling of random trajectories, the formulas (20) and (22) can be interpreted as a soft-max operation on the negative trajectory cost with as the temperature parameter. This temperature determines how sharply the algorithm discriminates between high- and low-cost samples.
Strict adherence to the PI theory requires setting
, i.e., the soft-max temperature is directly related to the noise level in the system. However, experience shows that, for most scenarios, such a direct implementation is a poor choice because the magnitude of the cost terms
can be arbitrary (
Williams et al., 2018a). Bad scaling may lead to numerical problems because the evaluation of the exponential can encounter numerical under- or overflow.
We use two methods to reduce the dependency on problem-specific scales. First, the weighted contribution of each sample is normalized by the minimum and maximum costs of the entire batch of drawn samples within the exponential:
The computation of the optimal input, corresponding to (22), then reads
where is the initial noise increment applied to sample and the sum runs over all samples .
The second mechanism we introduce is an automatic adjustment of the temperature coefficient . We initialize according to the theory, but then allow it to change depending on the effective sample size
This measure indicates the fraction of samples that are effectively contributing to the final OC. We find that is a reasonable target value that strikes a good compromise between choosing very few samples (high variance solution) and allowing many inferior samples to contribute to the optimal input. Our simple update strategy
automatically adapts each iteration to the current cost scale without introducing additional problem-specific tuning parameters.
4.3. Importance sampling
In an online control application of our algorithm, it is critical that updates to the OC are computed as quickly as possible. Most of the computation time is spent in the sampling procedure for evaluating the expectation (22).
As with other Monte Carlo methods, the required number of samples for a reliable statistical estimate depends on the distribution from which the samples are drawn. The mechanism for shaping the sampling distribution is called importance sampling. The following sections outline two techniques that we use in our controller to improve the efficiency of the sampling procedure.
4.3.1. Importance sampling with an ancillary controller
When estimating the expectation (22) with Monte Carlo samples, the trajectory distribution induced by the stochastic dynamics and the default
1 policy
is likely inefficient because an overwhelming majority of samples will end up in high-cost areas of the state space that are irrelevant to the optimal solution. When only a few “lucky” samples dominate, the Monte Carlo estimator has an extremely high variance in estimating the OCs. The negative effect on robot control is two-fold in this case. First, the optimal input trajectory computed by an iteration of the algorithm is very noisy due to the independent noise terms
in each time step, which are only averaged over a small number of samples. Second, the solution in the next iteration may be significantly different, leading to a jump in the reference state and input that is passed to the robot.
Importance sampling with an ancillary controller can be used to change the sampling distribution to provide an estimator of lower variance. An active importance sampling policy, , that still satisfies the problem’s constraints is generally of the form
By construction the resulting controls will always remain constraint satisfactory. The likelihood ratio (or Radon–Nikodym derivative in this case) that corrects for the shifted sample distribution can be computed through Girsanov’s theorem (
Theodorou and Todorov, 2012). As a result, we may sample from the controlled dynamics
and compute the desirability function as
If is stabilizing the system to remain in low-cost regions of the state space, the additionally arising control cost terms will be more than compensated for by a much lower state cost. Therefore, the introduction of the importance sampling policy results in much more informative samples.
Conveniently, the controller can take arbitrary forms. Suitable candidates that we use in this work are, for example: (a) last iteration’s optimal input sequence, (b) a model-based fallback controller, and (c) a learned sampling policy.
4.3.2. Elite samples
Another convenient trick that reduces the computational burden of simulating stochastic trajectories is the concept of elite samples, which is inspired by particle filtering and evolutionary algorithms (
De Jong, 1975).
In a receding horizon application of the SOC method, the starting states of consecutive calls to our controller are usually very close to each other. Therefore, it is likely that the input sequence of successful samples in the previous iteration is likely to produce good results in the current iteration too.
What we do, therefore, is to carry over the input sequence of the best samples from the previous iteration and re-evaluate their associated path costs. We find that saving approximately the top 10% of samples for the next iteration already results in more consistent expectation values and significantly smoother control commands across iterations.
4.4. Learning an importance sampling policy
So far, the introduced importance sampling techniques have assumed that a reasonable control sequence is available, either from the previous iteration of the algorithm or from an external ancillary controller.
In theory, even when starting without a known control sequence or policy, the repeated application of the PI algorithm should yield successively better control sequences and, subsequently, more informative samples each iteration. However, this improvement cycle can fail to start if the majority of rollouts fail to find an appropriate solution initially. The problem is significantly exacerbated in the case of unstable systems where the chances of finding stabilizing controls are diminishing and where previously stable (open-loop) control sequences may crash when subjected to a different noise realization.
A remedy to these issues is feedback control: a state-dependent controller
can be used during the stochastic rollouts to reject disturbances instead of forward-simulating fragile open-loop controls. Indeed it was shown by
Thijssen and Kappen (2015) that the optimal importance sampler coincides with the optimal feedback controller. Therefore, if our ancillary controller is already close to the optimal policy, the sampling procedure yields mostly informative samples and the estimated control command will be close to the optimum.
If no such policy is available
a priori, we can use our same algorithm to train a parametrized policy with machine learning: based on the cross-entropy formulation of PI control,
Kappen and Ruiz (2016) derive an update rule for the general class of sampling policies
that are parametrized by the vector
. An alternative derivation of a sum-of-squares error function by
Farshidian and Buchli (2013) yields the same update equation. The overall idea is to update the sampler’s parameters at each iteration of the PISOC algorithm with information from the currently sampled trajectories. The update rule is similar to the behavior cloning setup of IL
where is a learning rate and the effectively applied control command (including noise) in rollout at time .
In this work, our parametrized sampling policy is a feedforward neural network, but one may apply the same principle to other parametrized function classes such as motion primitives. For the case of neural networks, the gradient updates can conveniently be calculated and applied through standard machine learning software frameworks.
Similar to an off-policy policy search algorithm in the RL context, the parametrized policy can learn from any sample irrespective of the underlying controller that was used during the rollout to produce the noisy controls . In contrast to policy search, however, our learned importance sampler does not need to become a stable feedback controller before it becomes useful in guiding the sampling process towards low-cost regions. The reason is that it is still “deployed” as an importance sampler in a controlled diffusion process (28), thereby merely guiding the samples instead of dictating the final controls directly.
On the practical side, we note that the update rule (30) can lead to overfitting to a small part of the state space when the system remains stationary for an extended time period. This effect is not unsurprising, considering that the distribution training data becomes extremely centered around the current state. Unfortunately, such uniform data may result in catastrophic forgetting of how to behave in other parts of the state space. Inspired by prioritized experience replay methods in RL (
Lin, 1992;
Schaul et al., 2016), our countermeasure is to keep a history of the best performing sample for each of the last 500 time steps in memory and include them in the policy gradient computation.
5. Implementation
For the application of our PISOC algorithm on real robotic systems, we devise a software implementation that is agnostic to the problem-specific components such as the dynamics, cost function, and constraints. A schematic overview of the components of the method is depicted in
Figure 1. The algorithm is included as a solver module in the
Optimal Control for Switched Systems (OCS2) toolbox.
2 The key steps of the algorithm at each control cycle are outlined in
Algorithm 1.
5.1. Forward–backward pass
The computational procedure can be decomposed into two distinct phases, similar to OC shooting methods. First, our forward pass generates samples by rolling out the controlled stochastic dynamics. Sampling these trajectories can be thought of as a process of imagining possible future realizations. A discrete-time version of (28), which is suitable for implementation on a digital computer, reads
where we have used the Euler–-Maruyama discretization method with a fixed time step for simplicity. All state-dependent quantities are evaluated at and the noise increment is sampled according to . The covariance matrix is determined from (19) according to the currently active constraints and control costs.
During each rollout, we record a trajectory of states, inputs, noise increments, and stage cost. Should the system’s state diverge during the simulation, we directly stop the time integration and remove this sample from the memory because its contribution to the expectation would have been negligible in any case. The entire forward-pass procedure is implemented as a set of independent tasks (one per sample), which can be processed in parallel by several worker threads.
Afterwards, the backward pass is processing the batch of samples. The stage cost of each sample is accumulated from the final to initial time, providing a cost-to-go
for each time step. The relative importance of each sample and the resulting OC can subsequently be computed according to (23) and (24) for all time steps. Finally, the optimized control inputs and reference trajectory are transmitted to the robot to be tracked until the next iteration of PISOC is complete. Note that, despite suggested by
Williams et al. (2018a), we do not require any control signal smoothing and no such filter is implemented in our experiments. We suspect that the auto-tuning mechanism of the temperature parameter
ensures that the averaging effect of many samples is always large enough to forgo additional smoothing.
5.2. Importance sampling
All but the simplest systems under our consideration require a form of importance sampling to achieve the required sampling efficiency for real-time control. Apart from the mentioned options of simply re-using the optimal input sequence from the previous iteration, we experiment in Section 7 with using a deterministic sequential linear-quadratic (SLQ) MPC solver based on
Farshidian et al. (2017) and
Grandia et al. (2019) as ancillary controller. To avoid additional delays in our MPC iteration, we calculate such a computationally intense sampling policy
in parallel to the forward pass as yet another independent task. The employed sampling policy is, therefore, always the one computed during the previous iteration of our algorithm.
Furthermore, in experiments relating to a learned policy for importance sampling, we use a standard, fully connected feed-forward neural network architecture independent of the system at hand. The network has a single hidden layer with tanh activation and twice as many neurons as inputs.
6. Illustrative example
We use a simple, illustrative example to visually present some core properties of our algorithm. Consider a two-link, fixed base robot arm as depicted in
Figure 2. We want to model and control this system kinematically. Therefore, the state
of the robot contains the two joint angles and the control input
corresponds to the commanded joint velocities. The cost function penalizes control effort and encourages the end-effector to arrive at the goal location at the final time. We choose a time horizon of 1 second and an integration time step of 10 ms. In summary, the overall SOC problem reads
where is defined according to (19) with noise level . The control cost weight models a situation where the first joint is more costly to move: . Given this cost function, we intuitively expect the optimal solution to almost exclusively use the elbow joint only, because its associated control cost is much smaller than that of the shoulder joint.
We begin by investigating the effects of different noise levels
on this unconstrained problem. In
Figure 3, one can observe the relative frequency of task-space positions that are encountered by the sampling process for the unconstrained system without any importance sampling scheme. For very low noise levels, the shoulder joint is hardly excited by the perturbations and, subsequently, the bulk of the samples are concentrated around a circular arc around the elbow joint. As the noise level increases, more and more samples begin exploring a larger proportion of the state-space and, therefore, have a better chance of discovering possible solutions to the problem. For the same noise levels,
Table 1 lists the effective proportion of samples that contribute to the solution. For the nave
choice, the effective sample size correlates positively with the noise magnitude because the 10-fold increase in the noise dominates any variations in the path costs. Counterintuitively, despite samples spreading further into sub-optimal regions (see
Figure 3: more samples explore the region close to the origin, which is not expedient here), a larger proportion of samples contributes to the optimal solution. We can avoid this unfortunate dependence on the absolute noise magnitude with our auto-tuning mechanism. Subsequently, our numbers show an opposing trend: as samples spread further, a smaller fraction of samples ends up in low-cost regions. Therefore fewer of them contribute to the overall solution. At the same time, we are automatically keeping the relative sample size at a level where still sufficiently many noise realizations are averaged to produce a smooth control command.
We now introduce an additional constraint on the end-effector motion by requiring that the end-effector shall only move along a straight line between current position and goal (shown in red in
Figure 2). Such a specification can be motivated through safety concerns, e.g., because the permissible workspace of the robot is limited or because it is preferred that the robot moves in a predictable way. The scalar constraint equation is
with describing the direction along the constraint and being the end-effector Jacobian that fulfills .
Figure 4 shows sampled trajectories together with the chosen optimal path for the constrained and unconstrained problem (32). It is evident that they both reach the specified goal position, yet the paths they take are very different. Without constraint, as anticipated, the preferred solution uses mostly the cheaper and more noisy elbow joint. When subjected to the constraint, the samples have no other option than using both joints simultaneously to remain on the constraint manifold.
Note that it is unrealistic to expect that the samples exactly end up at the goal location because the initial state is relatively far away from the goal and it takes most samples at least 70 time steps to make their way to the goal. Although increasing the number of samples would help, it is also not a problem in practice because the algorithm usually runs in a receding horizon fashion, hence only the first few time steps are applied to the system before replanning.
Turning to the path cost distribution of the samples,
Figure 5 reveals that the best score achieved by any sample of the constrained problem is worse than the best score achieved in the unconstrained setting. This is expected because the constraint forces the system into a sub-optimal path with unnecessarily expensive shoulder joint motion. Important to note, however, is that the variance of the underlying sampling procedure is reduced by a factor of
in the constrained scenario. This variance reduction effect will gain more significance as more complicated (unstable, higher-dimensional) systems are considered in the following sections: While the optimal solution for the unconstrained 2D system in this section can still be found with a manageable number of samples, the sampling space in higher dimensions quickly grows beyond computational limits. We, therefore, regard it as vital to leverage constraints in order to reduce the sampling complexity and to generate more reliable, low-variance solutions at the expense of injecting some bias.
As a final showcase on this illustrative problem setting, we demonstrate that the introduction of a very simple importance sampling heuristic can already have noticeable effects on the sampling performance. To support this point, we introduce the importance sampling policy
, which solely moves the elbow joint at a fixed rate
such that it would end up in a configuration close to the desired one in the absence of noise.
Figure 5 shows that this simple heuristic already has visible effects on the sample distribution of the unconstrained system, which is now peaked at a better cost value. This suggests that the importance sampling policy need not be an elaborate controller to already have a significant effect on the sampling distribution.
7. Results
In this section, we apply the algorithm that is presented in Sections 4 and 5 to two unstable robotic systems. Extending the preliminary outcomes of the illustrative example, the following simulation and hardware results showcase the applicability of the method to a broad spectrum of robot designs and environments.
The noise level for each system is kept constant across all experiments unless specifically mentioned. Our rule of thumb to find an appropriate value for this scalar parameter is to increase it as much as possible for maximum exploration while keeping it limited to a level where the system remains controllable. Furthermore, the planning horizon is 1 second for both problems.
In both of the following examples, the internal dynamics (1) are implicitly given by the evaluation of a physics simulator. Specifically, we use the physics engine RaiSim (
Hwangbo et al., 2018) to forward-simulate the dynamics. This simulator was shown to perform fast and accurately with articulated robotic systems. In addition, it is simple to model the environment in the form of a heightmap inside the simulator in order to evaluate the performance of our algorithm in more realistic scenarios. In experiments that involve a heightmap, this map is predefined (either randomly generated or manually modeled) and static because a continuous perception and mapping pipeline is out of the scope of this work.
7.1. Ballbot: a ball-balancing robot
In this section, we show simulation results for Ballbot (
Figure 6), a ball-balancing robot . Its state comprises 10 DOF, corresponding to the planar position and 3D orientation of the base, as well as the time derivatives of said quantities. The system has three control inputs that match the torques applied between platform and ball. These commands can directly be passed to the simulation for forward integration without needing an intermediate tracking controller.
The OC problem that we are considering for this system has no constraints and a cost function of the form
where the time-dependent state reference is fixed to a nominal stationary state except for the linear base position and velocity components, which are determined by the operator’s command. Finally, the low-level control loop on the robot is merely tracking the commanded torques on the ball as commanded by the PISOC optimizer, absent of any state feedback or pre-stabilization.
In the following, we consider different importance sampling policies for the Ballbot SOC problem. One of them is an existing SLQ MPC algorithm developed by
Minniti et al. (2019), which we denote with
(actively controlled sampling policy). This algorithm internally uses a system model based on the rigid-body dynamics equations of motion. Importantly, the deterministic OC is computed under the assumption of perfectly flat ground, no slippage between the ball and the ground, and no actuator effects (
Farshidian et al., 2014) . Although the resulting ancillary policy should be a good sampling prior (
Thijssen and Kappen, 2015), the resulting estimator is generally not optimal owing to the restrictive modeling assumptions and the fact that no stochasticity is considered. We can expect, however, that the generated samples are in the vicinity of an optimal solution and, therefore, serve as a meaningful comparison for other sampling strategies in the following experiments.
7.1.1. PISOC performance
Despite the instability of the uncontrolled system, our algorithm manages to control the Ballbot even without an externally provided sampling policy. The relative ease of stabilization can be attributed to the fact that the system has an equilibrium point at the default (upright) state with zero input. Therefore, even the default passive policy automatically explores in all directions and yields sufficiently informative samples to compute an optimal stabilizing policy.
Although not strictly necessary, it is nonetheless insightful to observe the behavior of our algorithm when adding a neural network as parameterized importance sampling policy . In the case of Ballbot on flat ground, we observe that approximately 500 gradient update steps (i.e., controller time steps) are sufficient to make the policy converge to a stable importance sampler. Surprisingly, once converged, the sample size can be even be reduced to one, meaning that the learned sampler has converged to a stable feedback controller and is, therefore, a zero-variance estimator of the optimal input (22).
In order to make the task more interesting, we place the robot on hilly terrain, as shown in
Figure 6. Under this condition, we observe that the system remains stable but does not converge to its target when standing on sloped ground. Remarkably, even when employing the SLQ ancillary controller
a noticeable offset persists. The reason is that sufficiently steep slopes cause all rollouts that evolve around terrain-unaware policies to drift downhill, thereby creating a biased sampling distribution. In the following, we show that learning an importance sampling policy provides a remedy to this drift problem.
7.1.2. Online learning
The attempt to learn a parametrized importance sampling policy of the form for the Ballbot system on uneven ground would make the policy specific to the exact terrain that is used during training, thereby preventing generalization to unseen environments. Instead, more practical and interesting is the application of parameter updates (30) during online control. Indeed, we find that the learned sampler’s long-term performance in stabilizing the system with a small number of samples is significantly improved if the gradient updates (30) continue to be applied beyond an offline burn-in phase.
The receding-horizon application of our algorithm with learned importance sampling can be interpreted as a form of online learning: in contrast to learning a globally optimal policy before deployment, our sampler continuously adapts to the changing environment and state that it encounters at runtime. In particular, in the case of learning a linear state feedback controller, the update rule (30) can be interpreted as a form of integral controller that accumulates information from the state history. Different to a standard integral controller, however, is that the feedback signal to our learner is the error between estimated OC and applied control instead of the state tracking error. Even when the state error persists, this crucial distinction prevents the learner from experiencing unbounded integrator windup.
For the Ballbot system, the advantages of the integrating behavior can be seen when encountering inclined terrain.
Figure 7 shows that the samples that are unaware of the local terrain end up in locations far away from the goal at the origin. The learned controller, on the other hand, understands that a bias force needs to be applied to fight against the inclination. Therefore, despite being unaware of the terrain directly, the learned policy uses the information of previous rollouts to infer that it is advantageous to apply a bias force against the slope.
Finally, we also observe a failure mode of continuous learning in our experiment: when the system remains at a fixed state for longer periods of time, the collected data may eventually be harmful because the learner experiences local overfitting to the current state. This failure case emphasizes the need for our proposed replay memory structure to promote generalization. In addition, the PISOC structure easily allows switching the sampling policy to a fallback controller, which should be used in case a sudden drop in effective sample size is detected.
7.2. ANYmal: a quadrupedal robot
Our second system is the quadruped robot ANYmal, depicted in
Figure 8, for which we present hardware results . With 12 joints, 4 contact points, and a floating base, this system is significantly more complicated and unstable than Ballbot.
For the purpose of forward-simulating stochastic rollouts, we again use the RaiSim physics engine. Internally, the simulator treats ANYmal as an 18-DOF articulated system and expects a 12-dimensional control input, corresponding to the joint torques. However, to make our algorithm comparable to a deterministic MPC benchmark, the control space that the PISOC algorithm operates on is only a kinodynamic one. This means the control input
contains the contact forces
that act at each foot as well as the joint velocities. A full description of the kinodynamic model can be found in
Grandia et al. (2019). A standard inverse dynamics controller computes the transformation from these controls to joint torques. This controller and the simulator together are represented by the dynamic model quantities
. A subtle but important feature of this formulation is that the state of the system as seen by the PISOC algorithm evolves continuously because the joint velocities, which may be affected by an impact, are not part of the kinodynamic state.
We use a quadratic cost function analogous to (36). The only difference is that the control weight matrix penalizes velocities in the operational space of the foot instead of joint space. Importantly, this allows shaping of the noise magnitude in the Cartesian directions of the foot motion, which is more intuitive than doing so in joint space.
In this example, constraints are used to encode a pre-defined gait sequence that we want the walking robot to obey: support legs are forced to be stationary in the world reference frame while swing legs must not apply any contact force and are made to follow a predetermined velocity profile in the direction perpendicular to the ground:
Owing to the kinodynamic input vector, these conditions can be formulated in the control-affine form (12). For all following investigations on ANYmal, the switching times between stance and swing legs are assumed to be given according to a user-defined gait pattern.
For ANYmal there also exits a SLQ MPC implementation by
Grandia et al. (2019), which we use as a baseline active sampling policy. For this deterministic OC solver the robot is modeled as a kinodynamic system with 24 states and 24 inputs. Kinodynamic in this context means that the torso of the robot is approximated as a single rigid body to which massless legs are attached. Thus, the state includes the pose and twist of the robot’s base as well as three joint positions per leg. The control input corresponds to the foot contact forces and the joint velocities, just as introduced for the PISOC problem above.
The increased complexity of the problem forces us to execute the PISOC algorithm on a separate computer, distinct from that which runs the state estimator and tracking controller (see
Figure 1: physical separation of the shaded boxes). With a well-performing importance sampler, we find that drawing 64 samples in each iteration is a good compromise between computation time and solution quality and smoothness. On the robot side, the optimized kinodynamic state and control trajectories are translated into joint torques through a hierarchical inverse dynamics controller (
Bellicoso et al., 2017). The tracking controller and PISOC algorithm run asynchronously on different machines and subsequently always operate on the most recent state or OC solution available. This physical separation makes the control task more difficult owing to additional network delays.
7.2.1. Importance sampling is critical
The illustrative problem in Section 6 already showed that even a straightforward importance sampling scheme has the potential to improve the effective sample size. Now, we demonstrate that this effect becomes more pronounced for systems with more DOF.
In
Figure 9 we plot the sampling performance under different importance sampling policies
for the task of stabilizing a standing quadruped. We compare four different options for importance sampling: the default constrained policy of sampling around zero contact forces (
, “zero”), a simple heuristic input that commands zero joint velocities and constant contact forces that compensate the weight of the robot equally (“weight/4”), open-loop SLQ MPC policies (“SLQ openloop”), and feedback SLQ MPC policies (“SLQ feedback”).
Several important insights can be gained from this figure. First, even a simple pre-stabilization such as gravity compensation already improves the average sample performance by an order of magnitude. Second, another order of magnitude improvement can be gained by computing a control sequence that is aware of the current state and system evolution, as SLQ does it. In the presence of noise and a complex, unstable system, it is otherwise extremely unlikely to find good samples. Finally, a sampling policy with feedback drastically reduces the variance of the sampling variance because random perturbations can already be compensated during the rollout. However, the price for feedback is a lack of exploration, which prevents discovering trajectories that may achieve even lower costs.
When repeating the same experiment for a walking motion, we find that it is virtually hopeless to produce a good sample without an appropriate sampling policy for a system that is both unstable and has a large sampling space. The reader is referred to the supplementary video
3 for a demonstration.
In summary, our results provide empirical evidence that importance sampling is not only helpful but rather vital when the complexity and instability of a system grows. Conveniently, the structure of PISOC allows almost any existing controller to be incorporated as an importance sampler, making importance sampling almost no additional effort for many existing systems.
7.2.2. Learning importance sampling
Having established that using an ancillary controller as sampling policy can significantly improve the sampling efficiency, we now show that such a policy can also be learned with our PISOC algorithm. Doing so is particularly helpful for online control if the ancillary controller is not able to run in a real-time setting. In addition, learning can be useful because the trained policy can be conditioned on additional observations beyond the system’s state. We make use of this possibility when training a sampler for the ANYmal system. The augmented observation that serves as input to the neural network also includes a phase variable and contact flag for each leg that encodes the requested gait. For a fair comparison, the state that the learned policy receives is the same kinodynamic state that is used by the SLQ optimizer instead of the full dynamic state from the simulator.
To assess the quality of the sampling procedure, we compare average sample path cost, standard deviation of the costs, and total number of successful samples. An optimal sampler would consistently draw a sample with the lowest possible cost for the given starting state and noise realization. The progression of learning a sampling policy
for a static walk motion of ANYmal from randomly initialized parameters can be seen in
Figure 10. It can be observed that already after roughly 300 iterations (i.e., 300 gradient updates), the learned sampler begins to outperform the SLQ policy in terms of average reward and is on par with the number of successfully completed rollouts. A successful sample in this context is one that did not diverge numerically and did not end up in an unrecoverable state. The average reward of rollouts around the last iteration’s best input sequences (
) returns scores that are similar to the best importance sampling strategy, on average. The SLQ policy is biased towards a solution that is only optimal with its internal deterministic kinodynamic model and not the stochastic, simulated dynamics. However, its aggressive feedback gains yield very consistent sample returns independent of the noise realization, making it an especially low variance sampling policy. The fact that the spread of
is smaller than that of the open-loop elite samples indicates that the learned policy also developed some form of state feedback, which allows it to counter noise perturbations during a rollout. Visually, all three policies have very similar behavior.
These promising results can be reproduced for different gaits on ANYmal. However, we also note that the behavior of the learned sampling policy is only valid in the neighborhood of the data that it was presented. For unseen motions, e.g., larger setpoint changes than trained with, the learned sampler cannot consistently outperform the SLQ policy. This is due to the well-known problem of distribution mismatch between training and test data. Fortunately, the sampling procedure of PISOC acts as a form of safety net that does not assume that the importance sampler is an optimal policy in all states and hence provides an extra layer of robustness is these situations.
7.2.3. Constrained problems
Through our use of projection matrices into the constraint null space for both the controlled input (18) and (27) and the noise (19), it is guaranteed by design that all samples fulfill the constraint (12). The OC is a convex combination of the noisy controls applied during sampling. Therefore, also the OC satisfies the affine constraint. In
Figure 11, we validate this statement for the ANYmal example by plotting the task-space trajectories of one foot. From the figure, it is evident that the swing constraint (
direction) is fulfilled because all sampled trajectories coincide. In the other directions, the samples are free to explore in order to find suitable foothold positions, here shown exemplarily in the
direction. Note also that once a swing phase is complete, the feet are forced to remain stationary, which is individually enforced for each sample.
The illustrative example in Section 6 has already demonstrated that constraints can significantly reduce the sampling complexity and that this variance reduction usually comes at the expense of a biased solution. In the case of ANYmal, the difference between a constrained and unconstrained problem formulation is especially striking: With gait constraints, the samples reasonably explore where to step to keep balance, as can be observed in
Figure 11. When temporarily disabling the constraints (37) and (38), we could not observe a single sample that manages to find a suitable walking motion, even after several iterations of the algorithm. Instead, the optimal solution that could be found is a vibrating leg motion that can hardly keep the balance. Moreover, when issuing a forward-motion command in such an unconstrained setting, the robot begins to shift its body in the desired direction but fails to discover a stepping behavior and eventually collapses. The difference between the constrained and unconstrained setting is best appreciated by watching the accompanying video. The fact that not a single sample discovered how to achieve a lower path cost by stepping forward emphasizes that the use of constraints is an essential mechanism to bound the sampling complexity.
7.2.4. Robustness and exploration over local minima
A key motivation for stochastic, sampling-based algorithms is their ability to explore unknown states and thereby gain robustness and overcome local minima.
Figure 3 has already foreshadowed that the injection of noise promotes exploration, yet the cost landscape of the 2D robot arm was too simple to benefit from further exploration.
For ANYmal, we now show that there are indeed scenarios where exploration is critical for success. Consider the case where our robot is expected to cross a terrain segment where only a few sparse areas (“stepping stones”) represent suitable footholds.
Figure 12 displays the foot trajectories that are proposed by the sampling procedure when a single stone is reachable in the horizon. The visualization shows that the ancillary noise-free SLQ policy (drawn in yellow), which is unaware of the terrain, would have commanded the foot to miss the stepping stone and only hit it on the side. Such samples, however, are quickly rejected because the corresponding state diverges, as can be seen on the right-hand side of the figure. Therefore, despite our importance sampling policy predicting a fatal move, it is through the exploration that we are able to probe the surrounding area and discover safe footholds. In the video, we additionally show the case where the algorithm explores the option to step on different stones and sometimes decides to go zero, one, or two stones forward in a single swing phase. This result shows that sampling-induced exploration is a powerful mechanism for reacting to unforeseen circumstances and refining the underlying importance sampling policy.
Another interesting investigation is the relationship between the noise level and the robustness against unseen environments. In
Table 2, we relate the size of the gaps between stepping stones to the noise level that is required to discover the next safe foothold. As the overall magnitude of the noise is dependent on the scale of the cost function, we state the noise magnitude relative to the default value that is used throughout the rest of the experiments on ANYmal. From the table, it can be seen that generally more stochasticity is required as the task becomes harder (that is, further away from the nominal sampling policy). However, there is a limit to how much the noise can be increased because eventually the OC itself exhibits a noise magnitude that is not suitable for tracking anymore and leads to undesired vibration modes on the system.
7.2.5. Hardware deployment
Finally, we deploy the proposed algorithm on the physical ANYmal robot. Although the transfer from simulation to reality does not require any additional tricks on the algorithm side, a key challenge is the computational burden of the sampling procedure. The update rate on our machine (AMD Ryzen 9 3950X, 16 × 3.6 GHz) reaches approximately 15 Hz when rolling out 64 samples per iteration. The resulting average policy delay (computation and network transmission) of 0.12 second appears to be just fast enough for real-time control on ANYmal. In the accompanying video, one can observe a robust walking behavior that tracks the user’s commands in terms of displacement and yaw orientation.
Beyond flat-ground walking, we also successfully reproduce the stepping stone task (see Section 7.2.4 and
Figure 8) on hardware. The heightmap corresponding to the physical scenario is generated and loaded into the simulator once before execution starts and held static afterwards. For this scenario, too, our video shows that ANYmal reliably walks across the stones. Interestingly, the algorithm even sometimes plans a swing trajectory across two stones when the corresponding path cost is lower. The ease with which complicated scenarios can be modeled inside a physics engine, together with the fact that no algorithmic change is needed, underline the versatility of our approach.
Finally, we also verify the robustness against unmodeled ground variations on hardware. To this end, a few stones are randomly placed in front of the robot, as depicted in
Figure 13. Upon a forward-walking command from the operator, the robot successfully steps across the obstacles. Interestingly, when the robot steps on an obstacle, thereby violating the planned trajectory, audible vibrations indicate that currently fewer samples contribute to the solution and thereby make the optimized controls noisier. The corresponding scenes can also be seen in the video.
8. Conclusion
In this article, we have developed a stochastic control algorithm that is suitable for controlling high-dimensional, non-linear systems. It has been shown that the sampling mechanism waives the need for a differentiable dynamics model, thereby presenting itself with increased flexibility in modeling the system and the environment. In particular, we have demonstrated how a physics simulator can be used to represent the system’s interaction with complex geometries present in the environment.
Further, we have shown that standard Monte Carlo sampling for high-dimensional systems is intractable. Our results provide evidence that the two proposed countermeasures, namely constraints and importance sampling with an ancillary controller, are effective in restoring a workable sampling efficiency.
In addition to reducing the sampling complexity, our extension of PISOC to equality constraints gives the designer an intuitive handle to limit the stochastic exploration to safe and useful behaviors. In terms of future work, we imagine that the extension to path inequality constraints may be another interesting way of incorporating domain-specific knowledge which will boost sampling efficiency. In that regard, we imagine that Lagrange multiplier methods may provide interesting options to work around the restriction of quadratic path costs. Furthermore, it would be interesting to establish a relationship between the degree of simulation–reality mismatch and any performance or safety guarantees of the algorithm.
The mechanism of rolling out stochastic trajectories around an existing ancillary controller also has a secondary benefit: One may re-use existing control schemes and effectively view the sampling procedure as a safety net and refinement mechanism around proved high-performing controllers. For example, one could assign the responsibility for perception and obstacle avoidance to the sampling procedure only, whereas the blind ancillary controller contributes a stable core behavior. When deploying such a setup in practice, care must be taken with regard to the feedback characteristics of the ancillary controller: we observed that using an overly aggressive feedback controller as the underlying sampling policy can have adverse effects on the overall performance because the feedback terms heavily suppress the exploration capabilities of the sampling procedure.
The showcase for learning an importance sampler puts our method yet in another perspective and highlights the versatility of the algorithm: one can imagine use-cases where an importance sampler is first trained in an offline phase, either from scratch (RL) or using an existing controller (IL) and then refined online according to the local sample performance. Our experiments with ANYmal have also shown that a learned policy can be conditioned on further observations, such as the gait. Although the scope of our work was limited to perception-less control, it is not difficult to devise a pipeline where the neural network policy receives additional information from a perception sensor. In that case, the learned sampler could, for example, recognize suitable footholds and steer the samples accordingly. Furthermore, in the online learning scenario, it could be interesting to investigate in the future how the tradeoff between reactiveness and tendency to local overfitting can be balanced automatically.
Our sampling procedure is implemented in parallel on the CPU, making it fast enough for real-time control of a quadrupedal robot, as demonstrated in the previous section. The work of
Williams et al. (2017) foreshadows the performance gains that we could realize from also transitioning to a GPU parallelization in the future. Our current limitation in this respect is the availability of physics simulators that can operate on GPU. The recent appearance of GPU-accelerated simulation engines give us reason to expect that a GPU version of our algorithm is realistic in the foreseeable future. Such an implementation would not only make the algorithm faster but also eliminate network delays that arise from having to run the PISOC algorithm on a separate computer.
Lastly, in principle, it is possible to use a sampling-based method to also sample from discrete decisions. This option sounds especially appealing in a robotic context where contacts can be modeled as discrete decisions. For example, this idea may allow us to lift the constraints of a fixed contact schedule and start exploring over the different gaits of a walking robot. Initial experiments in this direction have revealed a fast combinatorial explosion which was not suitable for real-time control, but it may be an exciting research direction to pursue in the future.