# Constrained stochastic optimal control with learned importance sampling: A path integral approach

## Abstract

## 1. Introduction

*a priori*, the same sampling procedure can be used in an imitation learning (IL) or RL setting to bootstrap itself. Therefore, our algorithm can be viewed as an online sampling-based planner or an offline learning algorithm depending on the execution setting.

### 1.1 Statement of contributions

## 2. Related work

^{2}) method by Theodorou et al. (2010). Their key idea is to reformulate the stochastic optimal control (SOC) problem in a way such that the control action becomes a set of parameters of a feedback controller. The cardinality of the sampling space thereby reduces to the size of a single parameter vector, which is typically much smaller than that of the original control space (input dimension times number of time steps). Many works have followed up on the idea of optimizing parameters through PI ideas, for example, Buchli et al. (2011), Kalakrishnan et al. (2011), and Pastor et al. (2011).

## 3. Preliminaries: PI control

### 3.1. Problem formulation

### 3.2. Optimal cost-to-go

## 4. Method

### 4.1. Constrained PI control

*first*instance of the respective sampled path. Note that the computation of the OC is numerically robust against diverged samples because their high cost automatically makes their contribution to the expectation negligible. Alternatively, and with the same effect, failed samples may directly be discarded. The exact definition of failed samples is dependent on the problem context but generally corresponds to reaching a state from which the system cannot recover.

### 4.2. Automatic temperature tuning

### 4.3. Importance sampling

#### 4.3.1. Importance sampling with an ancillary controller

^{1}policy ${\mathbf{\pi}}_{c}$ is likely inefficient because an overwhelming majority of samples will end up in high-cost areas of the state space that are irrelevant to the optimal solution. When only a few “lucky” samples dominate, the Monte Carlo estimator has an extremely high variance in estimating the OCs. The negative effect on robot control is two-fold in this case. First, the optimal input trajectory computed by an iteration of the algorithm is very noisy due to the independent noise terms $\mathit{Q}\mathrm{d}\mathit{w}$ in each time step, which are only averaged over a small number of samples. Second, the solution in the next iteration may be significantly different, leading to a jump in the reference state and input that is passed to the robot.

#### 4.3.2. Elite samples

### 4.4. Learning an importance sampling policy

*a priori*, we can use our same algorithm to train a parametrized policy with machine learning: based on the cross-entropy formulation of PI control, Kappen and Ruiz (2016) derive an update rule for the general class of sampling policies ${\mathbf{\pi}}_{\mathit{\theta}}(t,\mathit{x})$ that are parametrized by the vector $\mathit{\theta}$. An alternative derivation of a sum-of-squares error function by Farshidian and Buchli (2013) yields the same update equation. The overall idea is to update the sampler’s parameters at each iteration of the PISOC algorithm with information from the currently sampled trajectories. The update rule is similar to the behavior cloning setup of IL

## 5. Implementation

*Optimal Control for Switched Systems (OCS2)*toolbox.

^{2}The key steps of the algorithm at each control cycle are outlined in Algorithm 1.

Algorithm 1 Single iteration of our PISOC algorithm |
---|

1: Given: Current time ${t}_{0}$, current state ${\mathit{x}}_{0}$2: Given: Ancillary or parametrized policy $\mathbf{\pi}$3: Initialize:${\gamma}_{u}={\gamma}_{J}$, ${U}_{\mathrm{elite}}=\varnothing $4: Forward Pass:5: clear sampleData 6: Sample stochastic trajectories around $\mathbf{\pi}$ 7: Sample stochastic trajectories around ${\mathit{u}}_{\mathrm{elite}}$ 8: Save all samples to sampleData 9: Backward Pass:10: Compute path cost $S$ for all samples 11: Evaluate opt. control ${\mathit{u}}^{*}(\tau )=\mathbb{E}[\cdots ]\phantom{\rule{0.25em}{0ex}}\forall \tau \in [{t}_{0},{t}_{f}]$ 12: Roll out ${\mathit{u}}^{*}(\xb7)$ to obtain ${\mathit{x}}^{*}(\xb7)$ 13: Save elite samples ${U}_{\mathrm{elite}}=\{{\mathit{u}}_{j}(\xb7){\}}_{j}$ 14: ${\gamma}_{u}\leftarrow $ temperature tuning 15: if$\mathbf{\pi}$ parametrized then16: Apply gradient update ${\mathit{\theta}}^{+}=\mathit{\theta}+\cdots $ 17: end if18: return Optimal input and state sequence $\{{\mathit{u}}^{*}(\xb7),{\mathit{x}}^{*}(\xb7)\}$ |

### 5.1. Forward–backward pass

### 5.2. Importance sampling

*in parallel*to the forward pass as yet another independent task. The employed sampling policy is, therefore, always the one computed during the previous iteration of our algorithm.

## 6. Illustrative example

${n}_{\mathrm{eff}}$ | ${\gamma}_{J}=0.01$ | ${\gamma}_{J}=0.1$ | ${\gamma}_{J}=1.0$ |
---|---|---|---|

${\gamma}_{u}={\gamma}_{J}$ | 0.10% | 0.24% | 2.73% |

Auto-tuned | 37.9% | 33.8% | 29.1% |

## 7. Results

### 7.1. Ballbot: a ball-balancing robot

#### 7.1.1. PISOC performance

#### 7.1.2. Online learning

### 7.2. ANYmal: a quadrupedal robot

#### 7.2.1. Importance sampling is critical

^{3}for a demonstration.

#### 7.2.2. Learning importance sampling

#### 7.2.3. Constrained problems

#### 7.2.4. Robustness and exploration over local minima

Gap width (cm) | 15 | 20 | 25 | 30 | 35 |

Relative noise level | 30% | 55% | 100% | 115% | — |

#### 7.2.5. Hardware deployment

## 8. Conclusion

## Acknowledgments

## Funding

## ORCID iD

## Footnotes

## Appendix A. Constrained path integral control derivation

### A.1. Optimal cost-to-go

### A.2. Minimization under constraints

### A.3. Transformation and diffusion process

## References

*IEEE Robotics and Automation Letters*5(2): 2864–2871.

*Journal of Machine Learning Research*13: 3207–3245.

*IEEE/RSJ International Conference on Intelligent Robots and Systems IROS*. IEEE, pp. 3359–3365.

*Dynamic Programming and Optimal Control*. 3rd edn. Athena Scientific.

*The International Journal of Robotics Research*30(7): 820–833.

*Springer Handbook of Robotics*. New York: Springer, pp. 163–194.

*Analysis of the Behavior of a Class of Genetic Adaptive Systems*. PhD Thesis, University of Michigan, University of Michigan, USA.

*Sequential Monte Carlo Methods in Practice*( Statistics for Engineering and Information Science). New York: Springer, pp. 3–14.

*The 1st Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM)*, pp. 4–8.

*IEEE/RSJ International Conference on Intelligent Robots and Systems IROS*. IEEE, pp. 1441–1446.

*IEEE International Conference on Robotics and Automation ICRA*, pp. 93–100.

*IEEE Control Systems Magazine*39(1): 26–55.

*Machine Learning and Knowledge Discovery in Databases*. Berlin: Springer, pp. 482–497.

*IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)*. IEEE, pp. 4730–4737.

*2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids)*, pp. 339–346.

*IEEE Robotics and Automation Letters*3(2): 895–902.

*Journal of Statistical Mechanics: Theory and Experiment*2005(11): P11011.

*Machine Learning*87(2): 159–182.

*Journal of Statistical Physics*162(5): 1244–1266.

*Reinforcement Learning for Robots Using Neural Networks*. PhD Thesis, Carnegie Mellon University, Carnegie Mellon University, USA.

*IEEE Robotics and Automation Letters*4(4): 3687–3694.

*Journal of Dynamic Systems, Measurement, and Control*137(5): 051016.

*Stochastic Differential Equations*. Berlin: Springer, pp. 65–84.

*Springer Handbook of Robotics*. New York: Springer, pp. 357–398.

*Adaptive Computation and Machine Learning*. Cambridge, MA: MIT Press.

*Journal of Machine Learning Research*11: 3137–-3181.

*2010 IEEE International Conference on Robotics and Automation*, pp. 2397–2403.

*Entropy*17(5): 3352–3375.

*Physical Review E*91: 032104.

*Proceedings of the National Academy of Science USA*106(28): 11478–11483.

*Journal of Guidance, Control, and Dynamics*40(2): 344–357.

*IEEE Transactions on Robotics*34(6): 1603–1622.

## Supplementary Material

### Supplemental Video 1

## Cite article

### Cite article

#### Cite article

#### Download to reference manager

If you have citation software installed, you can download article citation data to the citation manager of your choice

## Information, rights and permissions

### Information

#### Published In

**Article first published online**: October 12, 2021

**Issue published**: February 2022

#### Keywords

#### History

**Published online**: October 12, 2021

**Issue published**: February 2022

### Authors

## Metrics and citations

### Metrics

#### Journals metrics

This article was published in *The International Journal of Robotics Research*.

#### Article usage^{*}

Total views and downloads: 3055

^{*}Article usage tracking started in December 2016

#### Altmetric

See the impact this article is making through the number of times it’s been read, and the Altmetric Score.

Learn more about the Altmetric Scores

#### Articles citing this one

Web of Science: 1 view articles Opens in new tab

Crossref: 0

There are no citing articles to show.

## Figures and tables

### Figures & Media

### Tables

## View Options

### View options

#### PDF/ePub

View PDF/ePub### Get access

#### Access options

If you have access to journal content via a personal subscription, university, library, employer or society, select from the options below:

*loading institutional access options*

Alternatively, view purchase options below:

Purchase 24 hour online access to view and download content.

Access journal content via a DeepDyve subscription or find out more about this option.