Non-Linearity Measure for POMDP-based Motion Planning

Motion planning under uncertainty is essential for reliable robot operation. Despite substantial advances over the past decade, the problem remains difficult for systems with complex dynamics. Most state-of-the-art methods perform search that relies on a large number of forward simulations. For systems with complex dynamics, this generally require costly numerical integrations which significantly slows down the planning process. Linearization-based methods have been proposed that can alleviate the above problem. However, it is not clear how linearization affects the quality of the generated motion strategy, and when such simplifications are admissible. We propose a non-linearity measure, called Statistical-distance-based Non-linearity Measure (SNM), that can identify where linearization is beneficial and where it should be avoided. We show that when the problem is framed as the Partially Observable Markov Decision Process, the value difference between the optimal strategy for the original model and the linearized model can be upper bounded by a function linear in SNM. Comparisons with an existing measure on various scenarios indicate that SNM is more suitable in estimating the effectiveness of linearization-based solvers. To test the applicability of SNM in motion planning, we propose a simple on-line planner that uses SNM as a heuristic to switch between a general and a linearization-based solver. Results on a car-like robot with second order dynamics and 4-DOFs and 7-DOFs torque-controlled manipulators indicate that SNM can appropriately decide if and when a linearization-based solver should be used.


I. INTRODUCTION
An autonomous robot must be able to compute reliable motion strategies, despite various errors in actuation and prediction on its effect on the robot and its environment, and despite various errors in sensors and sensing.Computing such robust strategies is computationally hard even for a 3 DOFs point robot [1], [2].Conceptually, this problem can be solved in a systematic and principled manner when framed as the Partially Observable Markov Decision Process (POMDP) [3].A POMDP represents the aforementioned errors as probability distribution functions and estimates the state of the system as probability distribution functions called beliefs.It then computes the best motion strategy with respect to beliefs rather than single states, thereby accounting the fact that the actual state is never known due to the above errors.Although the concept of POMDPs was proposed in the '60s [4], only recently that POMDPs started to become practical for robotics problems (e.g.[5], [6], [7]).This advancement is achieved by trading optimality with approximate optimality for speed and memory.But even then, in general, computing close to optimal POMDP solutions for systems with complex dynamics remains difficult.
Several general POMDP solvers -solvers that do not restrict the type of dynamics and sensing model of the system, nor the type of distributions used to represent uncertainty-can now compute good motion strategies on-line with a 1-10Hz update rate for a number of robotic problems [8], [9], [10], [11].However, their speed degrades when the robot has complex non-linear dynamics.To compute a good strategy, today's POMDP solvers forward simulate the effect of many sequences of actions from different beliefs are simulated.For problems whose dynamics have no closed-form solutions, a simulation run generally invokes many numerical integrations, and complex dynamics tend to increase the cost of each numerical integration, which in turn significantly increases the total planning cost of these methods.Of course, this cost will increase even more for problems that require more or longer simulation runs, such as in problems with long planning horizons.
Many linearized-based POMDP solvers have been proposed [12], [13], [14], [15], [16].They rely on many forward simulations from different beliefs too, but use a linearized model of the dynamics and sensing for simulation.Together with linearization, many of these methods assume that beliefs are Gaussian distributions.This assumption improves the speed of simulation further, because the subsequent belief after an action is performed and an observation is perceived can be computed in closed-form.In contrast, the aforementioned general solvers typically represent beliefs as sets of particles and estimate subsequent beliefs using particle filters.Particle filters are particularly expensive when particle trajectories have to be simulated and each simulation run is costly, as is the case for motion-planning of systems with complex dynamics.As a result, the linearization-based planners require less time to estimate the effect of performing a sequence of actions from a belief, and therefore can potentially find a good strategy faster than the general method.However, it is known that linearization in control and estimation performs well only when the system's non-linearity is "weak" [17].The question is, what constitute "weak" non-linearity in motion planning under uncertainty?Where will it be useful and where will it be damaging to use linearization (and Gaussian) simplifications?
This paper extends our previous work [18] towards answering the aforementioned questions.Specifically, we propose a measure of non-linearity for stochastic systems, called Statistical-distance-based Non-linearity Measure (SNM), to help identify the suitability of linearization in a given problem of motion planning under uncertainty.SNM is based on the total variation distance between the original dynamics and sensing models, and their corresponding linearized models.It is general enough to be applied to any type of motion and sensing errors, and any linearization technique, regardless of the type of approximation of the true beliefs (e.g., with and without Gaussian simplification).We show that the difference between the value of the optimal strategy generated if we plan using the original model and if we plan using the linearized model, can be upper bounded by a function linear in SNM.Furthermore, our experimental results indicate that compared to recent state-of-the-art methods of non-linearity measures for stochastic systems, SNM is more sensitive to the effect that obstacles have on the effectiveness of linearization, which is critical for motion planning.
To further test the applicability of SNM in motion planning, we develop a simple on-line planner that uses a local estimate of SNM to automatically switch between a general planner [8] that uses the original POMDP model and a linearization-based planner (adapted from [12]) that uses the linearized model.Experimental results on a car-like robot with acceleration control, and a 4-DOFs and 6-DOFs manipulators with torque control indicate that this simple planner can appropriately decide if and when linearization should be used and therefore computes better strategies faster than each of the component planner.

II. BACKGROUND AND RELATED WORK A. Background
In this paper, we consider motion planning problems, in which a robot must move from a given initial state to a state in the goal region while avoiding obstacles.The robot operates inside deterministic, bounded, and perfectly known 2D or 3D environments populated by static obstacles.
The robot's transition and observation models are uncertain and defined as follows.Let S ⊂ R n be the bounded n-dimensional state space, A ⊂ R d the bounded d-dimensional control space and O ⊂ R l the bounded l-dimensional observation space of the robot.The state of the robot evolves according to a discrete-time non-linear function, which we model in the general form s t+1 = f (s t , a t , v t ) where s t ∈ S is the state of the robot at time t, a t ∈ A is the control input at time t, and v t ∈ R d is a random transition error.At each time step t, the robot perceives imperfect information regarding its current state according to a non-linear stochastic function of the form o t = h(s t , w t ), where o t ∈ O is the observation at time t and w t ∈ R d is a random observation error.
This class of motion planning problems under uncertainty can naturally be formulated as a Partially Observable Markov Decision Process (POMDP).Formally, a POMDP is a tuple S, A, O, T, Z, R, b 0 , γ , where S, A and O are the state, action, and observation spaces of the robot.T is a conditional probability function T (s, a, s ) = p(s | s, a) (where s, s ∈ S and a ∈ A) that models the uncertainty in the effect of performing actions, while Z(s, a, o) = p(o|s, a) (where o ∈ O) is a conditional probability function that models the uncertainty in perceiving observations.R(s, a) is a reward function, which encodes the planning objective.b 0 is the initial belief, capturing the uncertainty in the robot's initial state and γ ∈ (0, 1) is a discount factor.
At each time-step, a POMDP agent is at a state s ∈ S, takes an action a ∈ A, perceives an observation o ∈ O, receives a reward based on the reward function R(s, a), and moves to the next state.Now due to uncertainty in the results of action and sensing, the agent never knows its exact state and therefore, estimates its state as a probability distribution, called belief.The solution to the POMDP problem is an optimal policy (denoted as π * ), which is a mapping π * : B → A from beliefs (B denotes the set of all beliefs, which is called the belief space) to actions that maximizes the expected total reward the robot receives, i.e.
where τ (b, a, o) computes the updated belief estimate after the robot performs action a ∈ A and perceived o ∈ O from belief b, and is defined as: For the motion planning problems considered in this work, we define the spaces S, A, and O to be the same as those of the robotic system (for simplicity, we use the same notation).The transition T represents the dynamics model f , while Z represents the sensing model h.The reward function represents the task' objective, for example, high reward for goal states and low negative reward for states that cause the robot to collide with the obstacles.The initial belief b 0 represents uncertainty on the starting state of the robot.

B. Related Work on Non-Linearity Measures
Linearization is a common practice in solving non-linear control and estimation problems.It is known that linearization performs well only when the system's non-linearity is "weak" [17].To identify the effectiveness of linearization in solving non-linear problems, a number of non-linearity measure have been proposed in the control and information fusion community.
Many of these measures (e.g.[19], [20], [21]) have been designed for deterministic systems.For instance, [19] proposed a measure derived from the curvature of the non-linear function.The work in [20], [21] computes a measure based on the distance between the non-linear function and its nearest linearization.A brief survey of non-linearity measures for deterministic systems is available in [17].
Non-linearity measures for stochastic systems has been proposed.For instance, [17] extends the measures in [20], [21] to be based on the average distance between the non-linear function that models the motion and sensing of the system, and the set of all possible linearizations of the function.
Another example is [22] that proposes a measures which is based on the distance between distribution over states and its Gaussian approximation, called Measure of Non-Gaussianity (MoNG), rather than based on the non-linear function itself.Assuming a passive stochastic systems, this measures computes the negentropy between a transformed belief and its Gaussian approximation.The results indicate that this measure is more suitable to measure the non-linearity of stochastic systems, as it takes into account the effect that non-linear transformations have on the shape of the transformed beliefs.This advancement is encouraging and we will use MoNG as a comparator of SNM.However, for this purpose, MoNG must be modified since we consider non-passive problems in work.The exact modifications we made can be found in Section V-B.
Despite the various non-linearity measures that have been proposed, most are not designed to take the effect of obstacles to the non-linearity of the system into account.Except for MoNG, all of the aforementioned non-linearity measures will have difficulties in reflecting these effects, even when they are embedded in the motion and sensing models.For instance, curvature-based measures requires the non-linear function to be twice continuously differentiable, but the presence of obstacles is very likely to break the differentiability of the motion model.Furthermore, the effect of obstacles is likely to violate the additive Gaussian error, required for instance by [17].Although MoNG can potentially take the effect of obstacles into account, it is not designed to.In the presence of obstacles, beliefs have support only in the valid region of the state space, and therefore computing the difference between beliefs and their Gaussian approximations is likely to underestimate the effect of obstacles.
SNM is designed to address these issues.Instead of building upon existing non-linearity measures, SNM adopts approaches commonly used for sensitivity analysis [23], [24] of Markov Decision Processes (MDP) -a special class of POMDP where the observation model is perfect, and therefore the system is fully observable.These approaches use statistical distance measures between the original transition dynamics and their perturbed versions.Linearized dynamics can be viewed as a special case of perturbed dynamics, and hence this statistical distance measure can be applied as a non-linearity measure, too.We do need to extend these analysis, as they are generally defined for discrete state space and are defined with respect to only the transition models (MDP assumes the state of the system is fully observable).Nevertheless, such extensions are feasible and the generality of this measure could help identifying the effectiveness of linearization in motion planning under uncertainty problems.

III. SNM
Intuitively, our proposed measure SNM is based on the total variation distance between the effect of performing an action and perceiving an observation under the true dynamics and sensing model, and the effect under the linearized dynamic and sensing model.The total variation distance D T V between two probability measures µ and ν over a measurable space Ω is defined as D T V (µ, ν) = sup E∈Ω |µ(E) − ν(E)|.An alternative expression of D T V which we use throughout the paper is the functional form Formally, SNM is defined as: Definition 1.Let P = S, A, O, T, Z, R, b 0 , γ be the POMDP model of the system and P = S, A, O, T , Z, R, b 0 , γ be a linearization of P , where T is a linearization of the transition function T and Z is a linearization of the observation function Z of P , while all other components of P and P are the same.Then, the SNM (denoted as Ψ) between P and P is Ψ(P, P ) = Ψ T (P, P ) + Ψ Z (P, P ), where Note that SNM can be applied as both a global and a local measure.In the latter case, the supremum over the state s can be restricted to a subset of S, rather than the entire state space.Furthermore, SNM is general enough for any approximation to the true dynamics and sensing model, which means that it can be applied to any type of linearization and belief approximation techniques, including those that assume and those that do not assume Gaussian belief simplifications.
We want to use the measure Ψ(P, P ) to bound the difference between the expected total reward received if the system were to run the optimal policy of the true model P and if it were to run the optimal policy of the linearized model P .Note that since our interest is in the actual reward received, the values of these policies are evaluated with respect to the original model P (we assume P is a faithful model of the system).More precisely, we want to show that: Theorem 2. If π * denotes the optimal policy for P and π * denotes the optimal policy for P , then for any b ∈ B, ))do for any policy π and τ (b, a, o) is the belief transition function as defined in eq.( 2) To proof Theorem 2, we first assume, without loss of generality, that a policy π for a belief b is represented by a conditional plan σ ∈ Γ, where Γ is the set of all conditional plans.σ can be specified by a pair a, ν , where a ∈ A is the action of σ and ν : O → Γ is an observation strategy which maps an observation to a conditional plan σ ∈ Γ.
Every σ corresponds to an α-function α σ : S → R which specifies the expected total discounted reward the robot receives when executing σ starting from s ∈ S, i.e. α σ (s) = R(s, a) where a ∈ A is the action of σ and α ν(o) is the α-function corresponding to conditional plan ν(o).
For a given belief b, the value of the policy π represented by the conditional plan σ is then V π (b) = s∈S b(s)α σ (s)ds.Note that eq.( 5) is defined with respect to POMDP P .Analogously we define the linearized α-function α σ with respect to the linearized POMDP P by replacing the transition and observation functions in eq.( 5) with their linearized versions.Now, suppose that for a given belief b, σ * = arg sup σ∈Γ s∈S b(s)α σ (s)ds and σ * = arg sup σ∈Γ s∈S b(s) α σ (s)ds.σ * and σ * represent the policies π * and π * that are optimal at b for POMDP P and P respectively.For any s ∈ S we have that Since σ * is the optimal conditional plan for POMDP P at b, we also know that From eq.( 6), eq.( 7) and eq.( 8) it immediately follows that Before we continue, we first have to show the following Lemma: where R min = min s,a R(s, a) and R max = max s,a R(s, a).For any conditional plan σ ∈ Γ and any s ∈ S, the absolute difference between the original and linearized α-functions is upper bounded by The proof of Lemma 3 is presented in the Appendix A. Using the result of Lemma 3, we can now conclude the proof for Theorem 2. Substituting the upper bound derived in Lemma 3 into the right-hand side of eq.( 9) and re-arranging the terms gives us which is what we are looking for.

IV. APPROXIMATING SNM
Now, the question is how can we compute SNM sufficiently fast, so that this measure can be used as a heuristic during on-line planning to decide when a linearization-based solver will likely yield a good policy and when a general solver should be used.Unfortunately, such a computation is often infeasible when the planning time per step is limited.Therefore, we approximate SNM off-line and re-use the results during run-time.Here we discuss how to approximate the transition component Ψ T of SNM, however, the same method applies to the observation component Ψ Z .
Let us first rewrite the transition component of Ψ T as where Ψ T (s) is the transition component of SNM, given a particular state.To approximate Ψ T , we replace S in eq.( 11) by a sampled representation of S, which we denote as S. The value Ψ T (s) is then evaluated for each s ∈ S off-line, and the results are saved in a lookup-table.This lookup-table can then be used during run-time to get a local approximation of Ψ T around the current belief.The first question that arises is, how do we efficiently sample the state space?A naive approach would be to employ a simple uniform sampling strategy.However, for large state spaces this is often wasteful, because for motion planning problems, large portions of the state space are often irrelevant since they either can't be reached from the initial belief or are unlikely to be traversed by the robot during run-time.A better strategy is to consider only the subset of the state space that is reachable from the support set of the initial belief under any policy, denoted as S b0 .To sample from S b0 , we use a simple but effective method: Assuming deterministic dynamics, we solve the motion planning problem off-line using kinodynamic RRTs and use the nodes in the RRT-trees as a sampled representation of S b0 .In principle any deterministic sampling-based motion planner can be used to generate samples from S b0 , however, in our case RRT is a particularly suitable due to its space-filling property [25].Note that RRT generates states according to a deterministic transition function only.If required, one could also generate additional samples according to the actual stochastic transition function of the robot.However, in our experiments the state samples generated by RRT were sufficient.
The second difficulty in approximating Ψ T (s) is the computation of the supremum over the action space.Similar to restricting the approximation to a discrete set of states reachable from the initial belief, we can impose a discretization on the action space which leaves us with a maximization over discrete actions, denoted as Ã.Using the set Ã, we approximate eq.( 11) for each state in Sb0 -the sampled set of S b0 -as follows: Given a particular state s ∈ Sb0 and action a ∈ Ã, we draw n samples from the original and linearized transition function and construct a multidimensional histogram from both sample sets.In other words, we discretize the distributions that follow from the original and linearized transition function, given a particular state and action.Suppose the histogram consists of k bins.The value Ψ T (s, a) is then approximated as where and n i is the number of states inside bin i sampled from the original transition function, while p i = ni k j=1 nj and n i is the number of states inside bin i sampled from the linearized transition function.The right-hand side of eq.( 12) is simply the definition of the total variation distance between two discrete distributions.
By repeating the above process for each action in Ã and taking the maximum, we end up with an approximation of Ψ T (s).This procedure is repeated for every state in the set Sb0 .As a result we get a lookup-table, assigning each state in Sb0 an approximated value of Ψ T (s).
During planning, we can use the lookup-table and a sampled representation of a belief b to approximate SNM at b. Suppose b is the sampled representation of b (e.g., a particle set), then for each state s ∈ b, we take the state s near ∈ Sb0 that is nearest to s, and assign Ψ T (s) = Ψ T (s near ).The maximum SNM value max s∈ b Ψ T (s) gives us an approximation of the transition component of SNM with respect to the belief b.
Clearly this approximation method assumes that states that are close together should yield similar values for SNM.At first glance this is a very strong assumption.In the vicinity of obstacles or constraints, states that are close together could potentially yield very different SNM values.However, we will now show that under mild assumptions, pairs of states that are elements within certain subsets of the state space indeed yield similar SNM values.
Consider a partitioning of the state space into a finite number of local-Lipschitz subsets S i that are defined as follows: Definition 4. Let S be a metric space with distance metric D S .S i is called a local-Lipschitz subset of S if for any s 1 , s 2 ∈ S i , any s ∈ S and any a ∈ Furthermore let S i be a local-Lipschitz subset of S, then The proof for this Lemma is presented in Appendix B. This Lemma indicates that the difference between the SNM values for two states from the same local-Lipschitz subset S i depends only on the distance D S between them, since C Ti and C Ti are constant for each subset S i .Thus, as the distance between two states converges towards zero, the SNM value difference converges towards zero as well.This implies that we can approximate SNM for a sparse, sampled representation of S b0 and re-use these approximations on-line with a small error, without requiring an explicit representation of the S i subsets.

V. SNM-PLANNER: AN APPLICATION OF SNM FOR PLANNING
SNM-Planner is an on-line planner that uses SNM as a heuristic to decide whether a general, or a linearization-based POMDP solver should be used to compute the policy from the current belief.The general solver used is Adaptive Belief Tree (ABT) [8], while the linearization-based method called Modified High Frequency Replanning (MHFR), which is an adaptation of HFR [12].HFR is designed for chance-constraint POMDPs, i.e., it explicitly minimizes the collision probability, while MHFR is a POMDP solver where the objective is to maximize the expected total reward.An overview of SNM-Planner is shown in Algorithm 1.During run-time, at each planning step, SNM-Planner computes a local approximation of SNM around the current belief b i (line 5).If this value is smaller than a given threshold, SNM-Planner uses MHFR to compute a policy from the current belief, whereas ABT is used when the value exceeds the threshold (line 8-12).The robot then executes an action according the computed policy (line 13) and receives and observation (line 14).Based on the executed action and perceived observation, we update the belief (line 15).SNM-Planner represents beliefs as sets of particles and updates the belief using a SIR particle filter [26].Note that MHFR assumes that beliefs are multivariate Gaussian distributions.Therefore, in case MHFR is used for the policy computation, we compute the first two moments (mean and covariance) of the particle set to obtain a multivariate Gaussian approximation of the current belief.The process then repeats from the updated belief until the robot has entered a terminal state (we assume that we know when the robot enters a terminal state) or until a maximum number of steps is reached.
In the following two subsections we provide a brief an overview of the two component planners ABT and MHFR.
ABT is a general and anytime on-line POMDP solver based on Monte-Carlo-Tree-Search (MCTS).ABT updates (rather than recomputes) its policy at each planning step.To update the policy for the current belief, ABT iteratively constructs and maintains a belief tree, a tree whose nodes are beliefs and whose edges are pairs of actions and observations.ABT evaluates sequences of actions by sampling episodes, that is, sequences of state-action-observation-reward tuples, starting from the current belief.Details of ABT can be found in [8].

B. Modified High-Frequency Replanning (MHFR)
The main difference between HFR and MHFR is that HFR is designed for chance constraint POMDP, i.e., it explicitly minimizes the collision probability, while MHFR is a POMDP solver, whose objective is to maximize the expected total reward.Similar to HFR, MHFR approximates the current belief by a multivariate Gaussian distribution.To compute the policy from the current belief, MHFR samples a set of trajectories from the mean of the current belief to a goal state using multiple instances of RRTs [25] in parallel.It then computes the expected total discounted reward of each trajectory by tracking the beliefs around the trajectory using a Kalman Filter, assuming maximum-likelihood observations.The policy then becomes the first action of the trajectory with the highest expected total discounted reward.After executing the action and perceiving an observation, MHFR updates the belief using an Extended Kalman Filter.The process then repeats from the updated belief.To increase efficiency, MHFR additionally adjusts the previous trajectory with the highest expected total discounted reward to start from the mean of the updated belief and adds this trajectory to the set of sampled trajectories.More details on HFR and precise derivations of the method are available in [12].

VI. EXPERIMENTS AND RESULTS
The purpose of our experiments is two-fold: To test the applicability of SNM to motion planning under uncertainty problems and to test SNM-Planner.For our first objective, we compare SNM with a modified version of the Measure of Non-Gaussianity (MoNG) [22].Details on this measure are in Section VI-A.We evaluate both measures using two robotic systems, a car-like robot with 2 nd -order dynamics and a torque-controlled 4DOFs manipulator, where both robots are subject to increasing uncertainties and increasing numbers of obstacles in the operating environment.Furthermore we test both measures when the robots are subject to highly non-linear collision dynamics and different observation models.Details on the robot models are presented in Section VI-B, whereas the evaluation experiments are presented in Section VI-C.
To test SNM-Planner we compare it with ABT and MHFR on three problem scenarios, including a torque-controlled 7DOFs manipulator operating inside a 3D office environment.Additionally we test how sensitive SNM-Planner is to the choice of the SNM-threshold.The results for these experiments are presented in Section VI-D.
All problem environments are modelled within the OPPT framework [27].The solvers are implemented in C++.For the parallel construction of the RRTs in MHFR, we utilize 8 CPU cores throughout the experiments.All parameters are set based on preliminary runs over the possible parameter space, the parameters that generate the best results are then chosen to generate the experimental results.

A. Measure of Non-Gaussianity
The Measure of Non-Gaussianity (MoNG) proposed in [22] is based on the negentropy between the PDF of a random variable and its Gaussian approximation.Consider an n-dimensional random variable X distributed according to PDF p(x).Furthermore, let X be a Gaussian approximation of X with PDF p(x), such that X ∼ N (µ, Σ x ), where µ and Σ x are the first two moments of p(x).The negentropy between p and p (denoted as J(p, p)) is then defined as where are the differential entropies of p and p respectively.A (multivariate) Gaussian distribution has the largest differential entropy amongst all distributions with equal first two moments, therefore J(p, p) is always non-negative.In practice, since the PDF p(x) is not known exactly in all but the simplest cases, H(p) has to be approximated.
In [22] this measure has originally been used to assess the non-linearity of passive systems.Therefore, in order to achieve comparability with SNM, we need to extend the Non-Gaussian measure to general active stochastic systems of the form s t+1 = f (s t , a t , v t ).We do this by evaluating the non-Gaussianity of distribution that follow from the transition function T (s, a, s ) given state s and action a.
In particular for a given s and a, we can find a Gaussian approximation of T (s, a, s ) (denoted by T G (s, a, s )) by calculating the first two moments of the distribution that follows from T (s, a, s ).
Using this Gaussian approximation, we define the Measure of Non-Gaussianity as Similarly we can compute the Measure of Non-Gaussianity for the observation function: where Z G is a Gaussian approximation of Z.
In order to approximate the entropies H(T (s, a, s ))) and H(Z(s, a, o)), we are using a similar histogram-based approach as discussed in Section IV.The entropy terms for the Gaussian approximations can be computed in closed form, according to the first equation in eq.( 14) [28].

B. Robot Models 1) 4DOFs-Manipulator.
The 4DOFs-manipulator consists of 4 links connected by 4 torque-controlled revolute joints.The first joint is connected to a static base.In all problem scenarios the manipulator must move from a known initial state to a state where the end-effector lies inside a goal region located in the workspace of the robot, while avoiding collisions with obstacles the environment is populated with.
The state of the manipulator is defined as s = (θ, θ) ∈ R 8 , where θ is the vector of joint angles, and θ the vector of joint velocities.Both joint angles and joint velocities are subject to linear constraints: The joint angles are constrained by (−3.14, 3.14)rad, whereas the joint velocities are constrained by (6, 2, 2, 2)rad/s in each direction.Each link of the robot has a mass of 1kg.
The control inputs of the manipulator are the joint torques, where the maximum joint torques are (20, 20, 10, 5)N m/s in each direction.Since ABT assumes a discrete action space, we discretize the joint torques for each joint using the maximum torque in each direction, which leads to 16 actions.
The dynamics of the manipulator is defined using the well-known Newton-Euler formalism [29].For both manipulators we assume that the input torque for each joint is affected by zero-mean additive Gaussian noise.Note however, even though the error is Gaussian, due to the non-linearities of the motion dynamics the beliefs will not be Gaussian in general.Since the transition dynamics for this robot are quite complex, we assume that the joint torques are applied for 0.1s and we use the ODE physics engine [30] for the numerical integration of the dynamics, where the discretization (i.e.δt) of the integrator is set to δt = 0.004s.
The robot is equipped with two sensors: The first sensor measures the position of the end-effector inside the robot's workspace, whereas the second sensor measures the joint velocities.Consider a function g : R 8 → R 3 that maps the state of the robot to an end-effector position inside the workspace, then the observation model is defined as where w t is an error term drawn from a zero-mean multivariate Gaussian distribution with covariance matrix Σ w .The initial state of the robot is a state where the joint angles and velocities are zero.
When the robot performs an action where it collides with an obstacle it enters a terminal state and receives a penalty of -500.When it reaches the goal area it also enters a terminal state, but receives a reward of 1,000.To encourage the robot to reach the goal area quickly, it receives a small penalty of -1 for every other action.
2) 7DOFs Kuka iiwa manipulator.The 7DOFs Kuka iiwa manipulator is very similar to the 4DOFs-manipulator.However, the robot consists of 7 links connected via 7 revolute joints.We set the POMDP model to be similar to that of the 4DOFs-manipulator, but expand it to handle 7DOFs.For this robot, the joint velocities are constrained by (3.92, 2.91, 2.53, 2.23, 2.23, 2.23, 1.0)rad/s in each direction and the link masses are (4, 4, 3, 2.7, 1.7, 1.8, 0.3)kg.Additionally, the torque limits of the joints are (25, 20, 10, 10, 5, 5, 0.5)N m/s in each direction.For ABT we use the same discretization of the joint torques as in the 4DOFs-manipulator case, i.e.we use the maximum torque per joint in each direction, resulting in 128 actions.Similarly to the 4DOFs-manipulator, we assume that the input torques are applied for 0.1s and we use the ODE physics engine with an integration step size of 0.004s to simulate the transition dynamics.The observation and reward models are the same as for the 4DOFs-manipulator.The initial joint velocities are all zero and almost all joint angles are zero too, except for the second joint, for which the initial joint angle is −1.5rad.Figure 1(c) shows the Kuka manipulator operating inside an office scenario.
3) Car-like robot.A nonholonomic car-like robot of size (0.12 × 0.07 × 0.01) drives on a flat xy-plane inside a 3D environment populated by obstacles The robot must drive from a known start state to a position inside a goal region without colliding with any of the obstacles.The state of the robot at time t is defined as a 4D vector s t = (x t , y t , θ t , υ t ) ∈ R 4 , where x t , y t ∈ [−1, 1] is the position of the center of the robot on the xy-plane, θ t ∈ [−3.14, 3.14]rad the orientation and υ t ∈ [−0.2, 0.2] is the linear velocity of the robot.The initial state of the robot is (0.7, 0.7, 1.57rad, 0) while the goal region is centered at (0.7, 0.7) with radius 0.1.The control input at time t, a t = (α t , φ t ) is a 2D real vector consisting of the acceleration α ∈ [−1, 1] and the steering wheel angle φ t ∈ [−1rad, 1rad].The robots dynamics is subject to control noise v t = (α t , φt ) ∼ N (0, Σ v ).The robots transition model is x t + ∆tυ t cos θ t y t + ∆tυ t sin θ t θ t + ∆t tan(φ t + φt )/0.11 where ∆t = 0.3s is the duration of a timestep and the value 0.11 is the distance between the front and rear axles of the wheels.This robot is equipped with two types of sensors, a localization sensor that receives a signal from two beacons that are located at (x 1 , ŷ1 ) and (x 2 , ŷ2 ).The second sensor is a velocity sensor mounted on the car.With these two sensors the observation model is defined as where w t is an error vector drawn from a zero-mean multivariate Gaussian distribution with covariance matrix Σ w .Similar to the manipulators described above, the robot receives a penalty of -500 when it collides with an obstacle, a reward of 1,000 when reaching the goal area and a small penalty of -1 for any other action.

C. Testing SNM
In this set of experiments we want to understand the performance of SNM compared to MoNG in various scenarios.In particular, we are interested in the effect of increasing uncertainties and the effect that obstacles have on the effectiveness of SNM, and if these results are consistent with the performance of a general solver relative to a linearization-based solver.Additionally, we want to see how highly-nonlinear collision dynamics and different observation models -one with additive Gaussian noise and non-additive Gaussian noise -affect our measure.For the experiments with increasing motion and sensing errors, recall from Section VI-B that the control errors are drawn from zero-mean multivariate Gaussian distributions with covariance matrices Σ v .We define the control errors (denoted as e T ) to be the standard deviation of these Gaussian distributions, such that Σ v = e 2 T × 1. Similarly for the covariance matrices of the zero-mean multivariate Gaussian sensing errors, we define the observation error as e Z , such that Σ w = e 2 Z × 1.Note that during all the experiments, we use normalized spaces, which means that the error vectors affect the normalized action and observation vectors.For SNM and MoNG we first generated 100,000 state samples for each scenario, and computed a lookup table for each error value off-line, as discussed in Section IV.Then, during run-time we calculated the average approximated SNM and MonG values.
1) Effects of increasing uncertainties in cluttered environments.
To investigate the effect of increasing control and observation errors to SNM, MoNG and the two solvers ABT and MHFR in cluttered environments, we ran a set of experiments where the 4DOFs-manipulator and the car-like robot operate in empty environments and environments with obstacles, with increasing values of e T and e Z , ranging between 0.001 and 0.075.The environments with obstacles are the Factory and Maze environments shown in Figure 1(a) and (b).For each scenario and each control-sensing error value (we set e T = e Z ), we ran 100 simulation runs using ABT and MHFR respectively with a planning time of 2s per step.
The average values for SNM and MoNG and the relative value differences between ABT and MHFR in the empty environments are presented in Table I.The results show that for both scenarios SNM and MoNG are sensitive to increasing transition and observation errors.This resonates well with the relative value difference between ABT and MHFR.The more interesting question is now, how sensitive are both measures to obstacles in the environment?Table II(a) and (b) shows the results for the Factory and the Maze scenario respectively.It is evident that SNM increases significantly compared to the empty environments, whereas MoNG is almost unaffected.Overall obstacles increase the relative value difference between ABT and MHFR, except for large uncertainties in the Maze scenario.This indicates that MHFR suffers more from the additional non-linearities that obstacles introduce.SNM is able to capture these effects well.
An interesting remark regarding the results for the Maze scenario in Table II(b) is that the relative value difference actually decreases for large uncertainties.The reason for this can be seen in Figure 2. As the uncertainties increase, the problem becomes so difficult, such that both solvers fail to compute a reasonable policy within the given planning time.However, clearly MHFR suffers earlier from these large uncertainties compared to ABT.
2) Effects of increasingly cluttered environments.To investigate the effects of increasingly cluttered environments on both measures, we ran a set of experiments in which the Car-like robot and the 4DOFs-manipulator operate inside environments with an increasing number of randomly distributed obstacles.For this we generated test scenarios with 5, 10, 15, 20, 25 and 30 obstacles that are uniformly distributed across the environment.For each of these test scenarios, we randomly generated 100 environments.Figure 3(a)-(b) shows two example environments with 30 obstacles for the Car-like robot and the 4DOFs-manipulator.For this set of experiments we don't take collision dynamics into account.The control and observation errors are fixed to e t = e z = 0.038 which corresponds to the median of the uncertainty values.Table III presents the results for     SNM, MoNG and the relative value difference between ABT ant MHFR for the 4DOFs-manipulator (a) and the car-like robot (b).From these results it is clear that, as the environments become increasingly cluttered, the advantage of ABT over MHFR increases, indicating that the obstacles have a significant effect on the Gaussian belief assumption of MHFR.Additionally SNM is clearly more sensitive to those effects compared to MoNG, whose values remain virtually unaffected by the clutterness of the environments.
3) Effects of collision dynamics.Intuitively, collision dynamics are highly non-linear effects.Here we investigate SNM's capability in capturing these effects compared to MoNG.For this, the robots are allowed to collide with the (a) 4DOFs-manipulator with increasing number of obstacles  obstacles.In other words, colliding states are not terminal and the dynamic effects of collisions are reflected in the transition model.For the 4DOFs-manipulator these collisions are modeled as additional constraints (contact points) that are resolved by applying "correcting velocities" to the colliding bodies in the opposite direction of the contact normals.
For the Car-like robot, we modify the transition model eq.( 18) to consider collision dynamics such that where This transition function causes the robot to slightly "bounce" off obstacles upon collision.There are two interesting remarks regarding this transition function: The first one is that eq.( 21) is a deterministic.In other words, a collision causes an immediate reduction of the uncertainty regarding the state of the robot.Second, while the collision effects eq.( 21) are linear, eq.( 20) is not smooth since the collision dynamics induce discontinuities when the robot operates in the vicinity of obstacles.
Table IV shows the comparison between SNM and MoNG and the relative value difference between ABT and MHFR for the 4DOFs-manipulator operating inside the Factory environment (a) and the car-like robot operating inside the Maze environment (b) while being subject to collision dynamics.It can be seen that the additional non-linear effects are captured well by SNM.Interestingly, compared to the results in Table II(a), where the 4DOFs-manipulator operates in the same environment without collision dynamics, MoNG captures the effects of collision dynamics as well, which indicates that collision dynamics have a large effect on the Gaussian assumption made by MHFR.Looking at the relative value difference between ABT and MHFR confirms   In the previous experiments we assumed that the observation functions are non-linear functions with additive Gaussian noise, a special class of non-linear observation functions.This class of observation functions has some interesting implications: First of all, the resulting observation distribution remains Gaussian.This in turn means that MoNG for the observation function evaluates to zero.Second, linearizing the observation function results in a Gaussian distribution with the same mean but different covariance.We therefore expect that the observation component SNM remains small, even for large uncertainties.To investigate how SNM reacts to non-linear observation functions with non-additive noise, we ran a set of experiments for the 4DOFs-manipulator operating inside the Factory environment and the car-like robot operating inside the Maze environment where we replaced both observation functions with non-linear functions with non-additive noise.For the 4DOFs-manipulator we replaced the observation function defined in eq.(17) with where w t ∼ N (0, Σ w ).In other words, the manipulator has only access to a sensor that measure the position of the end-effector in the workspace.For the car-like robot we use the following observation function: where w 1 t , w 2 t , w 3 t T ∼ N (0, Σ w ).For both robots, we set e t = 0.038.Table V shows the values for the observation components of SNM and MoNG for the 4DOFs-manipulator operating inside the Factory environment as the observation errors increase.As expected, for additive Gaussian errors, MoNG is zero, whereas SNM is small but measurable.This shows that SNM is able to capture the difference of the variance between the original and linearized observation functions.For non-additive errors, the observation function is non-Gaussian, therefore we can see that both measures increase as the observation errors increase.Interestingly for both measures the observation components yield significantly smaller values compared to the transition components.This indicates that the non-linearity of the problem stems mostly from the transition function.For the car-like robot operating inside the Maze environment we see a similar picture.For the observation function with additive Gaussian errors, Table VI(a) shows that MoNG remains zero for all values of e Z , whereas SNM yields a small but measurable value.Again, both measures increase significantly in the non-additive error case in

D. Testing SNM-Planner
In this set of experiments we want to test the performance of SNM-Planner in comparison with the two component planners ABT and MHFR.To this end we tested SNM-Planner on three problem scenarios: The Maze scenario for the car like robot shown in Figure 1(a) and the Factory scenario for the 4DOFs-manipulator.Additionally we tested SNM-Planner on a scenario in which the Kuka iiwa robot operates inside an office environment, as shown in Figure 1(b).Similarly to the Factory scenario, the robot has to reach a goal area while avoiding collisions with the obstacles.The planning time per step is 8s in this scenario.For the SNM-threshold we chose 0.5.Here we set e T = e Z = 0.038.
The results in  able to approximately identify when it is beneficial to use a linearization-based solver and when a general solver should be used.In all three scenarios, SNM-Planner outperforms the two component planners.In the Maze scenario, the difference between SNM-Planner and the component planners is significant.The reason is, MHFR is well suited to compute a long-term strategy, as it constructs nominal trajectories from the current state estimate all the way to the goal, whereas the planning horizon of ABT is limited by the depth of the search tree.However, in the proximity of obstacles, the Gaussian belief assumption of MHFR are no long valid, and careful planning is required to avoid collisions with the obstacles.
In general ABT handles these situations better than MHFR.SNM-Planner combines the benefits of both planners and alleviates their shortcoming.Figure 4 shows state samples for which the SNM-values exceed the given threshold of 0.5.It is obvious that many of these samples are clustered around obstacles.In other words, when the support set of the current belief (i.e. the subset of the state space that is covered by the belief particles) lies in open areas, MHFR is used to drive the robot towards the goal, whereas in the proximity of obstacles, ABT is used to compute a strategy that avoids collisions with the obstacles.A similar behavior was observed in the KukaOffice environment.During the early planning steps, when the robot operates in the open area, MHFR is well suited to drive the end-effector towards the goal area, but near the narrow passage at the back of the table, ABT in general computes better motion strategies.Again, SNM-Planner combines both strategies to compute better motion strategies than each of the component planners alone.
1) Sensitivity of SNM-Planner.In this experiment we test how sensitive the performance of SNM-Planner is to the choice of the SNM-threshold.Recall that SNM-Planner uses this threshold to decide, based on a local approximation of SNM, which solver to use for the policy computation.For small thresholds SNM-Planner favors ABT, whereas for large thresholds MHFR is favored.
For this experiment we test SNM-Planner on the Factory problem (Figure 1(b)) with multiple values for the SNM-threshold, ranging from 0.1 to 0.9.For each threshold value we estimate the average total expected discounted reward achieved by SNM-Planner using 1,000 simulation runs.Here we set e T = e Z = 0.038.
Table IX summarizes the results.It can be seen that the choice of the threshold can affect the performance of SNM-Planner, particularly for values that are on either side of the spectrum (very small values or very large values) where SNM-Planner favors only one of the component solvers.However, between the threshold values of 0.2 and 0.5 the results are fairly consistent, which indicates that there's a range of SNM-threshold values for which SNM-Planner performs well.

VII. SUMMARY AND FUTURE WORK
This paper presents our preliminary work in identifying the suitability of linearization for motion planning under uncertainty.To this end, we present a general measure of non-linearity, called Statistical-distance-based Non-linearity Measure (SNM), which is based on the distance between the distributions that represent the system's motion-sensing model and its linearized version.Comparison studies with one of state-of-the-art methods for non-linearity measure indicate that SNM is more suitable in taking into account obstacles in measuring the effectiveness of linearization.
We also propose a simple on-line planner that uses a local estimate of SNM to select whether to use a general POMDP solver or a linearization-based solver for robot motion planning under uncertainty.Experimental results indicate that our simple planner can appropriately decide where linearization should be used and generates motion strategies that are comparable or better than each of the component planner.
Future work abounds.For instance, the question for a better measure remains.The total variation distance relies on computing a maximization, which is often difficult to estimate.Statistical distance functions that relies on expectations exists and can be computed faster.How suitable are these functions as a non-linearity measure?Furthermore, our upper bound result is relatively loose and can only be applied as a sufficient condition to identify if linearization will perform well.It would be useful to find a tighter bound that remains general enough for the various linearization and distribution approximation methods in robotics.

APPENDIX
For writing compactness we use the following shorthand notations for the transition and observation functions thoughout the next two subsections: T = T (s, a, s ), T = T (s, a, s ) and Z = Z(s, a, o), Z = Z(s, a, o).Additionally in section Section B we use the notations T k = T (s k , a, s ) and T k = T (s k , a, s ).
From the definition of the total variation distance, it follows that s ∈S T − T ds = 2D s,a T V (T, T ) for any given s ∈ S and a ∈ A and o∈O Z − Z do = 2D s ,a T V (Z, Z) for any given s ∈ S. Substituting these equalities into eq.(27) and taking the supremum over the conditionals s, s and a allows us to upper-bound eq.( 27 For the last inequality we bound the terms |T 1 − T 2 | and T 1 − T 2 using Definition 4. Furthermore we use the fact that sup |f |≤1 s ∈S f (s )ds = 1, assuming that the state space S is normalized.This concludes the proof of Lemma 5.
where C Ti ≥ 0 and C Ti ≥ 0 are finite local-Lipschitz constants In other words, S i are subsets of S in which the original and linearized transition functions are Lipschitz continuous with Lipschitz constants C Ti and C Ti .With this definition at hand, we can now show the following lemma: Lemma 5. Let S be a n − dimensional metric space with distance metric D S and assume S is normalized to [0, 1]n .

else 10 :
a = ABT(b i , t p )

Fig. 1 :
Fig. 1: Test scenarios for the different robots.The objects colored black and gray are obstacles, while the green sphere is the goal region.(a) The Maze scenario for the car-like robot.The blue squares represents the beacons, while the orange square at the bottom left represents the initial state.(b) The 4DOFs-manipulator scenario.(c) The KukaOffice scenario

Fig. 2 :Fig. 3 :
Fig. 2: The average total discounted rewards achieved by ABT and MHFR in the Maze scenario, as the uncertainties increase.Vertical bars are the 95% confidence intervals.
(a) Maze environment with collision dynamics

Fig. 4 :
Fig. 4: State samples in the Maze scenario for which the approximated SNM value exceeds the chosen threshold of 0.5

TABLE II :
Average values of SNM, MonG and the relative value difference between ABT and MHFR for the 4DOFs-manipulator operating inside the Factory environment (a) and the car-like robot operating inside the Maze environment (b).

TABLE III :
Average values of SNM, MonG and relative value difference between ABT and MHFR for the 4DOFs-manipulator (a) and the car-like robot (b) operating inside environments with increasing numbers of obstacles.

TABLE IV :
Average values of SNM, MonG and the relative value difference between ABT and MHFR for the 4DOFs-manipulator operating inside the Factory environment (a) and the car-like robot operating inside the Maze environment (b) while being subject to collision dynamics.this.MHFR suffers more from the increased non-linearity of the problems caused by collision dynamics compared to ABT.This effect aggravates as the uncertainty increases, which is a clear indication that the problem becomes increasingly non-linear with larger uncertainties.Looking at the results for the car-like robot operating in the Maze scenario presents a similar picture.Comparing the results in Table IV(b) where collision dynamics are taken into account to Table II(b), shows that collision dynamics have a significant effect both to SNM as well as Measure of

TABLE VI :
Comparison between the observation component of SNM and MoNG for the car-like robot operating inside the Maze environment with observation function eq.(19)(a) and observation function eq.(23)(b) as the observation errors increase.
Table VI(b).The question is now, how do ABT and MHFR perform in both scenarios when observation functions with non-additive Gaussian errors are used?Table VII(a) shows this relative(a) Factory environment with non-additive observation errors e T = e Z SNM MoNG V ABT (b 0 )−V MHFR (b 0 ) V ABT (b 0 )

TABLE VII :
Average values of SNM, MonG and the relative value difference between ABT and MHFR for the 4DOFs-manipulator operating inside the Factory environment (a) and the car-like robot operating inside the Maze environment (b) with non-additive observation errors

TABLE VIII :
Average total discounted reward and ± 95% confidence interval over 1,000 simulation runs.The proportion of ABT being used in the Maze, Factory and Office scenarios is 37.85%, 56.43% and 42.33% respectively.valuedifferenceforthe4DOFs-manipulator operating inside the Factory environment.It can be seen that as the errors increase, the relative value difference between ABT and MHFR increase significantly, compared to the relative value difference shown in TableII(a), where an observation function with additive errors is used.Similarly, for the car-like robot operating inside the Maze scenario using the observation function with non-additive errors, the relative value difference shown in tableTableVII(b) between the two solvers is much larger compared to Table II(b).

TABLE IX :
Average total discounted reward and 95% confidence intervals of SNM-Planner on the Factory problem for varying SNM-threshold values.The average is collected over 1,000 simulation runs.The last column shows the percentage of ABT being used as the component solver.