Optimal Control Based on CACM-RL in a Two-Wheeled Inverted Pendulum Regular Paper

This  work  aims  to  present  a  new  optimal control  scheme  based  on  the  CACM‐RL  technique applied  tounstable  systems  such  as  a  Two‐Wheeled Inverted  Pendulum  (TWIP).The main  challenge  in  this work  is  to  verify  and  validate  the  good  behaviour  of CACM‐RL  in  this  kind  of  system.  Learning  while maintaining the equilibrium is a complex task. It is easy in stable platforms because the system never reaches an unstable  state,  but  in  unstable  systems  it  is  very difficult.  The  study  also  investigates  implementing CACM‐RL to coexist with a classic control solution. The results show that the proposed method works perfectly in unstable systems, providing better results than a PID


Introduction
Path planning and optimal control algorithms for solving the problem of trajectory generation for vehicles and robots have been addressed in several research projects.These mobile platforms are nonlinear dynamic systems whose motion laws have been widely studied [1-2].
According to [2], the goal of motion planning can be stated as: "given an initial configuration and a desired final configuration of the robot, [to] find a path, starting at the initial configuration and terminating at the final configuration, while avoiding collisions with obstacles".In our case, the trajectory definition is extended to other systems where there are no obstacles but there are special states, which the system should not reach.Furthermore, if a cost function (time, distance or energy) is minimized, the optimal motion problem is addressed.Besides solving the basic problem, intelligent vehicles must exhibit an optimal behaviour in real scenarios.This supposes an additional aim in the design of efficient algorithms for motion planning in autonomous vehicles with restricted computational resources.
Global optimal motion planning of vehicles and robots remains an open problem.Research is focused on different methods based on trajectory generation strategies.Research carried out in pursuit of optimal motion planning has used different solutions based on two different methods: open-loop and closed-loop.Open-loop or offline approaches [1] [3-6] find a collision-free path from previous information about the environment (obstacle position, characteristics and geometrical shapes of the space where the vehicle can move, etc.).The goal of these methods is to mode calcula dynamic and which is bein 9] are more d perturbations There are oth optimal moti but without techniques ar the optimal between the achieved so period.Other offer an effec local solution [11][12][13][14][15][16].Optim both analytic provide ade advantages methods and optimal moti proposed by

Two-whee
The inverted control engin exercise in te into a pract marketed in 2 The objectiv system (TWI corresponds top.In

Cellmapping techniques
Cellmapping techniques include, on the one hand, an efficient application of numerical methods in order to integrate non-linear systems (even unstable) and, on the other, the Bellman's Principle of Optimality to find the optimal control efficiently.The result is a new optimal control method based on a cell statespace able to design efficient optimal controllers of highly unstable and nonlinear systems.The design method of controllers based on cellmapping techniques [14] can be divided into two stages:  Obtain a family of cell-to-cell mappings that constitutes the necessary knowledge to calculate the optimal control laws associated with the states of the system.


Search through the optimal control laws, taking into account the Principle of Optimality (intelligence searching techniques).The process finishes when an Optimal Control Table (OCT) is found, which acts as a controller.
Cell-to-cell mapping methods are based on a discretization of the state variables of the system, defining a partition of the state space into cells.A cell-to-cell mapping can be derived from the dynamic evolution of the system.In [14], the Control Adjoining Cell Mapping (CACM) algorithm for optimal control of highly nonlinear systems is proposed.This method is based on the Adjoining Cell Mapping (ACM) technique, whose central concept is the creation of a cell mapping where only transitions between adjoining cells are allowed.Let us consider a system under the action of a P dimensional control vector, which can only take some finite values Nu belonging to a set U. It is assumed that the control is maintained at a constant level during a time interval t.
When an action is applied to the system during a time interval, the system can go to a new state (image cell), remain at the same state or go to the drain (out of the state space).The knowledge of the system is given by the set of transitions to the image states.The maximum size of the set of transitions is Nc× Nu, where Ncis the total number of cells or states.
The adjoining property states that the distance (D-k) between any cell and its map (image) is equal to some predefined integer value k ≥ 1.The distance between two cells z and z'is defined as The integration time of each transition is determined adaptively to carry out the adjoining property.Such a property provides substantial improvements with respect to other optimal control techniques, since the CACM algorithm only computes the meaningful information required to obtain a good approximation to the optimal solution.Sometimes it may happen that the transition never maps into a D-k cell, for example when it is trapped in the origin cell or it goes to a drain cell.In these cases, the algorithm stops and it changes the control action in order to emerge from the cell or enter the state space.The CACM technique carries out a shortest-path search in an efficient manner, reducing both memory requirements and computation time.
When building the OCT, a cost function is defined to indicate the cost for a control action to map a cell to its image.The cost can be defined in terms of time, energy, distance or any other factor.Since a cost function is specific for a cell, it can be used as a local performance measure for controller evaluation.An overall performance measure for a controller can be generated from the local measures by simply averaging all the costs on all controllable cells.

Reinforcement Learning
Reinforcement learning methods only require a scalar reward (or punishment) to learn to map situations (states) in actions [10].They only need to interact with the environment to learn from experience.The knowledge is saved in a look-up table that contains an estimation of the accumulated reward to reach the goal from each situation or state and applied policy.The objective is to find the actions (policies) that maximize the accumulated reward in each state.Q-learning is one of the most popular reinforcement learning methods, since with a simple formulation it can address model-free optimization problems [10] [17].
The convergence of this algorithm towards the optimal policy p * was proven by Watkins [18].The accumulated reward for each pair state-action Q(s, a) is updated (backup) by the one-step equation: whereQ is the expected value of performing action ain state s, r is the reward, αis the learning rate that controls convergence and γ is the discount factor.The discount factor makes rewards earned earlier more valuable than those received later.If the reward function is proper [19] the discount factor can be omitted (γ = 1).The action a, with highest Q value at states, is the best policy up to instant t.

CACM-RL
The cell mapping techniques include, on the one hand, an efficient application of the numerical methods in order to integrate nonlinear systems and, on the other, the Bellman's Principle of Optimality to find the optimal control efficiently.Designing controllers based on the cell mapping techniques [12][13][14][15][16] results in an efficient optimal control method for nonlinear systems.Cell-to-cell mapping methods are based on a discretization of the state variables of the system, defining a partition of the state space into cells.A cell-to-cell mapping can be derived from the dynamic evolution of the system.In [12-14], solutions based on cell mapping techniques for the design of optimal controllers are proposed.In [14], the Control Adjoining Cell Mapping (CACM) method is implemented, which consists of the creation of a cell mapping where only transitions between adjoining cells are allowed.
It is necessary to define a control vector that can only take some finite values, Nu.It is assumed that the control is kept constant during a time interval, t.When an action is applied to the system during a time interval, the system may go to a new state, remain at the same state or go to the drain (out of the state space).The knowledge of the system is given by the set of transitions from different origin states.The maximum size of the set of transitions is Nc x Nu, where Nc is the total number of cells or states.
Reinforcement learning methods only require a scalar reward (or punishment) to learn to map situations (states) in actions [10].They only need to interact with the environment to learn from experience.The knowledge is saved in a look-up table that contains an estimation of the accumulated reward to reach the goal from each state and applied policy.The objective is to find the actions (policies) that maximize the accumulated reward in each state.
The new algorithm proposed by the authors in [15],CACM-RL, combines the cellmapping techniquesand the reinforcement learning approaches in order to conceive a single efficient optimal control algorithm.
CACM-RL deals with different data structures to store the partial results of the learning process.These structures are described in


Q_Table(s,a) is where the accumulated reward for each state-action pair, Q(s,a), is saved.From this table, the optimal policy a* is obtained.The table is updated according to the one-step equation [15].


During the learning phase, we must pay special attention to the reward values implemented in the CACM-RL algorithm.Since the optimal criterion was to minimize the time, the reward could have three different values in the following three cases: 1) a generic transition different from the goal is reached; 2) the goal is reached; 3) the TWIP goes out of the considered state space.In the first case, the reward is equal to the transition time, but with a negative sign.Usually, it will be an integer number of sample periods.In the second case, the reward is equal to the maximum value (positive).In the third case, the system is punished with very large negative reward.When learning, the maximum value of the reward stored in the goal is spread to all state space.In this way, those controllable states (from which the goal is reached) far away from the goal will have a positive low reward, and vice versa. Model_Table(s,a) is the model of the TWIP system.It contains the transitions from the origin state for each velocity(', x')(there are as many velocities as there are cells), for each tilt() and for each action control.The transitions have to satisfy the adjoining property to ensure a good approximation of the optimal policy.For simplicity, the origin state is defined as: =0; x=0. policy(s)selects a specific policy to estimate the Model_Table and to exploit the best policy acquired. IT(x,x') is the operator that transforms a generic transition x-x' into a transitionat the origin.When all possible transitions from the origin have been performed, we can conclude that the learning stage has finished (Model_Table just created).In order to perform a real approximation of the transformation to the origin, ' and x' are averaged out and filtered. Generic transition is defined as occurring within the considered state space and whose starting state is different from the origin state. DT(x)is the operator that transforms a transitionat the origin into a generic transition.In this way, the whole knowledge (generic transitions) of the state space can be generated.

Figure
Figure 3. Balan The simplicit mathematica of system ha the new optim In Figure 4, during 8 seco show the co detail in Fig control actio and by the o red line, two can be identi as before wi energy redu during a con Figure 5.