International Journal of Advanced Robotic Systems Near-optimal Tracking Control of a Nonholonomic Mobile Robot with Uncertainties Regular Paper

A  combined  kinematic/torque  control  law  is developed by using a backstepping design approach for a nonholonomic  mobile  robot  with  two  driving  wheels mounted on  the  same axis  to  track a  reference  trajectory. The auxiliary velocity  control  inputs are designed  for  the kinematic  steering  system  to  make  the  posture  error asymptotically  stable. Next, a  computed‐torque  controller is designed such that the mobile robot’s velocities converge on  the  given  velocity  inputs  in  an  optimal  manner  by converting the tracking control problem into the regulation problem  whereby  the  uncertainties  in  the  dynamics  of mobile  robots  are  considered.  The  proposed  online  and forward‐in‐time  policy  iteration  (PI)  algorithm  based  on approximate  dynamic  programming  (ADP)  is  used  to solve the optimal control problem with unknown  internal dynamics  by  using  single  neural  networks  (NNs)  to approximate  the  cost  function.  Afterwards,  the  near‐ optimal control policy can be computed directly according to  the  cost  function, which  removes  the  action  network appearing in the ordinary ADP method. The stability of the dynamical  extension  system  is  demonstrated  using Lyapunov methods. The simulation results are provided to demonstrate the effectiveness of the proposed approach.


Introduction
A differentially driven wheeled mobile robot (WMR) is a typical nonholonomic system where the wheels are assumed to roll without slipping [1,2].In the mean time, it is also an intrinsically nonlinear system with uncertainties in the dynamic model.The tracking control of such system turns out to be a nontrivial problem due to both its challenging theoretic nature and its practical importance.
Originally, many works [3][4][5][6] consider only the kinematic model of the mobile robot, with the assumption that the control signals instantaneously generate the actual velocity control inputs.However, the perfect velocity tracking [7] assumption does not hold in practice.
Controllers based on a full dynamic model [1,2,[8][9][10] capture better behaviour because they account for dynamic effects such as mass, friction and inertia, which are neglected by kinematic controllers.Optimization algorithms, such as genetic algorithms (GAs), Ant Colony Optimization (ACO) and Particle Swarm Optimization (PSO), have been used to find the optimal intelligent controller for WMR [11][12][13][14][15].However, the proposed control scheme only ensures the stability of the closedloop system and the satisfactory tracking of the output to the given reference signal.There were no optimality criteria which were considered in the control objective.In many cases, it is desirable that the tracking control law not only stabilizes the system but also that it renders optimality based on a pre-defined cost function [16][17][18][19].
From a mathematical point of view, the sufficient condition for solving this optimal control problem is the solution to the Hamilton-Jacobi-Bellman (HJB) equation [18,19].However, for nonlinear systems, finding a cost function that satisfies the HJB equation is challenging because it requires the solution of a partial differential equation that cannot be solved explicitly.For this reason, considerable efforts have been devoted to developing ADP algorithms [20,21], including attempts to use, analyse or develop general-purpose methods to find good approximate answers to this optimization problem, using learning or approximation methods to cope with complexity.Actor-critic (AC) architectures [16,22] have been proposed as models of ADP algorithms since AC methods are amenable to online implementation.Typically, the AC architectures consists of two NNs -an actor NNs and a critic NNs.The actor NNs approximates the optimal control law and generates the control signals while the critic NNs rates the quality of the control signals by the approximation of the cost function.
As part of the optimal control and one of several important new tools for intelligent control, the ADP algorithm presented in this paper does not require preliminary learning.It works online and consists of only one NNs to approximately solve the HJB equation, while the internal dynamics in terms of the velocity tracking errors are considered as unknown difference with AC architectures mentioned above.
The paper is organized as follows.Section 2 provides the kinematic and dynamic model of WMR.The formulation of the adaptive optimal tracking control problem is shown in Section 3 and a unifying design framework is proposed based on a backstepping control approach and nonlinear-optimal control theory.The Lyapunov theory guarantees the stability of the dynamical extension system while considering the error between the cost function and its approximation by using NNs' approximation structures.The convergence proof of the combined control law is presented in Section 4. Section 5 evaluates the control performance of the near-optimal controller by comparing with the initial stabilizing control.Finally, Section 6 gives some concluding remarks.

Kinematic and Dynamic Model of the WMR
The WMR as shown in Fig. 1 consists of a vehicle with two driving wheels mounted on the same axis and a passive self-adjusted supporting wheel, which carries the mechanical structure.The two driving wheels are independently driven by two actuators (e.g., DC motors).It is assumed that the mobile robot under study is made up of a rigid frame equipped with no deformable wheels and that they are moving on a horizontal plane.

Kinematic Model
For the WMR system considered here, the pure rolling and non-slipping, nonholonomic condition (1) states that the robot can only move in the direction that is perpendicular to the axis of the driving wheels: The kinematic constraint (1) can be written as: The null space of ( ) A q is given by the matrix ( ) S q .As such: ( ) ( ) 0 A q S q  (3) The vector q  has to lie in this null space, therefore: where ( ) S q is a Jacobian matrix that transforms the velocities  in the WMR base coordinates into velocities in the Cartesian coordinates q  , and v and w are the linear velocity of the point C along the robot axis and the angular velocity respectively.System (4) is called the kinematic model of the robot.

Dynamic Model
Using the Euler-Lagrange equations, the dynamical equations of motion can then be derived as:   Here, m is the mass, I is the moment of inertia of the robot around its centre of mass, ( ) F q  denotes the surface friction, d  denotes bounded unknown disturbancesincluding the unstructured unmolded dynamics -( ) B q is the input transformation matrix,  is the input vector - which includes the right and left wheel torques -and  is the vector of the constraint forces.
Next, we differentiate (4) with respect to time, substituting the expression for q  in (5), multiplying the resulting expression by T S for the elimination of the vector of constraint forces  ; the dynamic system ( 5) is now transformed into a more appropriate representation (6) for control purposes: From the perspective of backstepping control, the control design problem of the WMR can be described thus: it generates the desired velocity profiles for the mobile robot to follow a reference trajectory (called motion control) and then the control inputs to the robot (mostly the driving torques/voltages of the motors) are determined to achieve the required velocities that take into account the mass, friction, etc., parameters of the actual cart (called speed control).

Tracking Control Problem Formulation
In the trajectory tracking task, the mobile robot is required to follow a trajectory generated by a reference robot, prescribed as (7), where it moves at the desired linear and angular velocities, r v and r w respectively where To track a reference trajectory is to find a control law, which makes the real robot follow a given reference moving posture r q with stability.As time approaches infinity, lim ( ) ( )

Motion Control
The tracking error is expressed relative to the local coordinate frame xed on the mobile robot as: e e e e T q q    (8) and the derivative of the error ( 8) is: An auxiliary velocity control input [1] that achieves tracking for (4) is given by:   , , 0 k k k  are the design parameters and the derivative of c  becomes: since the perfect velocity tracking assumption is unrealistic.Therefore, the actual control inputs to the robot must be considered in order to achieve the required speeds.

Near-optimal Speed Control
Define the auxiliary velocity tracking error as: Differentiating ( 12) and using ( 6), the mobile robot dynamics may be written in terms of the velocity tracking error as: where the function ( ) h z contains all of the mobile robot parameters, such as masses, moments of inertia, friction coefficients, and so on.These quantities are often imperfectly known and difficult to determine.In applications, the nonlinear function ( ) h z is at least partially unknown.
Therefore, a suitable control input for following velocity is given by the computed-torque like control: where

K k I 
is a diagonal, positive definite gain matrix and ˆ( ) h z is the nominal part of ( ) h z .The reinforcement control input u is designed to make sure of the speed control in an optimal manner.
Using this control input ( 14) in ( 13), the closed-loop system becomes: Eq. ( 15) can be rewritten as: Define the infinite horizon integral cost function as: (2) Then, an infinitesimal version of ( 17) is the so-called nonlinear Lyapunov equation: The optimal control problem can now be formulated: given the continuous-time system ( 16), the admissible control set

 
u U   and the infinite horizon cost function (17), find an admissible control policy such that the cost index (17) associated with the system ( 16) is minimized.
Defining the Hamiltonian of the problem: The optimal cost function defined by * ( ) c V e satisfies the HJB equation: Assuming that the minimum on the right hand side of (20) exists and is unique, then the optimal control function for the given problem is: Inserting this optimal control policy (21) in the Hamiltonian (19), we obtain the formulation of the HJB equation in terms of In order to find the optimal control solution * u for the problem, one only needs to solve the HJB equation (22) for the cost function * V and then substitute the solution in ( 21) to obtain the optimal control.However, the HJB equation is a partial differential equation, due to its nonlinear nature, and finding its solution is generally difficult or even impossible.Moreover, it requires complete knowledge of the system dynamics, which is hard to obtain.Considering practical applications of the WMR, the difficulties involved in modelling practical systems exactly, and unavoidable disturbances, the effective tracking control design of an uncertain WMR needs to be studied.
In the following, we discuss a new online algorithm based on PI which will adapt to solve the continuoustime (CT) optimal control problem without using any knowledge regarding the system's internal dynamics (i.e., the system function ( ) c f e ) when nonlinear systems with an infinite horizon quadratic cost are considered.
Proof of the convergence of the algorithm to the optimal control function is provided in next section.

Modified PI Algorithm for Solving the HJB Equation
Instead of trying to solve the HJB equation directly, the PI method starts by evaluating the cost of a given initial policy and then tries to use this information to obtain a new improved control policy.
The modified PI algorithms [19,23] proposed by the online adaptive critic [16,21,24,25] techniques are built as a two-step iteration, where i denotes the number of iterative steps: (1) Policy evaluation: solve for the cost function: (2) Policy improvement: update the control policy using: arg min[ ( , , )] Given an initial stabilizing control policy We solve this iteratively between Eqs. ( 23) and ( 24) without making use of any knowledge of the internal dynamics of the system,

Single NNs-based ADP Algorithm for the HJB's Approximate Solution
In order to solve Eq. ( 22) using the modified PI algorithm proposed previously, we will use a NN to obtain an approximation of the unknown cost function due to its universal approximation property [2].

The cost function
can be approximated by two-layer feedforward NNs: where is the vector of linearly independent activation functions of the hidden layer with N neurons.The weights of the hidden layer are all equal to one and will not be changed during the training procedure.
denotes the weight vector of the output layer which has linear activation functions.
Using the NN's approximate description for the cost function in Eq. ( 23), will have the temporal difference residual error: where: Using the inner product notation for the Lévesque integral, ( 28) can be rewritten as: According with the properties of the inner product, (29) becomes: If there exist values of T such that  is invertible [17], then we obtain: From (24), we get the new control policy: In order to make a difference with (24), we let Eqs. (31) and (32) are successively solved at each iteration i until convergence.
A structure for the tracking control system is presented in Figure 2.

Convergence Proof
Theorem 1.The policy iterations (23) and ( 24) converge uniformly to the optimal control problem (17), without using any knowledge of the internal dynamics of the controlled system (16): Proof: Theorem 1 in [19] shows that From Eq. ( 17), we get: The infinitesimal version of ( Integrating (34) over the time interval   , t t T  , we obtain (23).This means that the unique solution of (34) also satisfies (23).We have to show that Eq. ( 23) has a unique solution to complete the proof by contradiction.
Subtracting ( 18) from (34), we obtain: 23) has a unique solution which is equal to the unique solution of (34).
The algorithm between (34) and ( 24) is equivalent to the iteration between ( 23) and (24), which has been proven converge on the solution of the HJB equation [27].
Theorem 2. Given an initial admissible control (0) u where the number of neurons in the hidden layer is sufficiently large, then the control policy gets from the PI algorithms (31) and (32) a cost function approximation From the result we proved above, u , if the number of neurons in the hidden layer and the iterative steps are large enough, the control policy (32), with a cost function approximation (31), will converge on the optimal control (21) without using any knowledge of the internal dynamics of the controlled system (16).The auxiliary velocity tracking error c e and the posture tracking error p e are exponentially stable when a persistence of excitation condition is satisfied, as: * * sup sup u , The results ①、② in Theorem 3 follow directly from Theorems 1 and 2.
Establishing a Lyapunov function: Using (9), we have the expression of L V  as follows: This implies that   From the above, the exponential stability of the auxiliary velocity tracking error c e and the posture tracking error p e are obvious.

Simulation Results
We would like to implement the near-optimal control scheme presented in Fig. 2 and compare its performance with the initial one to show the effectiveness of the control law developed in this work.
We took the WMR parameters as follows: The control gains were selected as: The reference trajectory is generated using the reference model in ( 7), with The weighting matrix Q should be large enough to weigh heavily in the cost function (17).Accordingly, the accumulated states' error could be small and the weighting matrix R should be selected as large if we want to be able to keep the energy consumption as small as possible.For a convenient simulation, we selected For the NNs, we selected the activation functions with 8 N  hidden-layer neurons as:    Fig. 4 shows the initial reinforcement control inputs (46) and the near-optimal reinforcement control inputs (47) for controller (14).We can see the change of the reinforcement control input in controller (14).The velocity tracking error with a near-optimal controller is shown in Fig. 5.The responses with different controllers are shown in Fig. 6 -the tracking error p e convergence was quicker than before.It is clear that the performance of the WMR with near-optimal control has been improved with respect to the initial WMR.As such, we can use this policy iteration method to improve the preexisting controllers' effectiveness.

Conclusion
In this paper, we proposed an online PI algorithm based on ADP to solve the optimal control problem with unknown internal dynamics.It uses a single NN to approximate the cost function and then computes the near-optimal control law directly according to the approximation of the cost function.The action networks [22,25,28] are no longer needed as an important additional advantage, and the associated iterative training loops are also eliminated.This leads to a notable simplification of the architecture and results in substantial computational savings.Besides this, it also eliminates the NN's approximation error due to the eliminated action networks.
Both wheels have the same radius, denoted by r .The two driving wheels are separated by 2R .The centre of mass of the mobile robot is located at point C. The pose of the robot in the global coordinate frame OXY can be completely specied by three generalized coordinates   T q x y   , where , x y are the coordinates of the point C in the global coordinate frame and  is the orientation of the local frame CXcYc attached to the robot platform measured from the X-axis.

Figure 1 .
Figure 1.System configuration of the WMR matrix Q is symmetric and positive semi- definite and R is a symmetric positive definite matrix.Definition 1 Admissible ControlA control ( ) c u e is defined as admissible with respect to(17), denoted by

V
with respect to c e .
 , which are sampled online.In the spirit of the reinforcement learning algorithms, the integral term in (23) can be addressed as the reinforcement over the time interval   t t T  .

Figure 2 .
Figure 2. Combined kinematic/torque near the optimal tracking control structure then 2 0 e  as t   .

(
The initial stabilizing controller was taken as:

Figure 3 .
Figure 3. Convergence of the NNs' weightsThe simulation was conducted using data obtained from the system at every0.1 T s .At each iteration step, we solved for the NNs' weights 8 u W using 1000 data points associated with a given control policy over 30 time intervals.In this way, after every 3 s, the cost function was solved for a policy update.The weights of the NNs converged on the coefficients of the optimal cost function, as one can see from Fig.3  *

Figure 4 .
Figure 4. Initial control VS near-optimal control