A learning-based optimal tracking controller for continuous linear systems with unknown dynamics: Theory and case study

In this article, a novel continuous-time optimal tracking controller is proposed for the single-input-single-output linear system with completely unknown dynamics. Unlike those existing solutions to the optimal tracking control problem, the proposed controller introduces an integral compensation to reduce the steady-state error and regulates the feedforward part simultaneously with the feedback part. An augmented system composed of the integral compensation, error dynamics, and desired trajectory is established to formulate the optimal tracking control problem. The input energy and tracking error of the optimal controller are minimized according to the objective function in the infinite horizon. With the application of reinforcement learning techniques, the proposed controller does not require any prior knowledge of the system drift or input dynamics. The integral reinforcement learning method is employed to approximate the Q-function and update the critic network on-line. And the actor network is updated with the deterministic learning method. The Lyapunov stability is proved under the persistence of excitation condition. A case study on a hydraulic loading system has shown the effectiveness of the proposed controller by simulation and experiment.


Introduction
Accurate tracking control has drawn great research interests in a number of application fields. [1][2][3] Optimal control deals with problems of minimizing the prescribed objective function in the infinite or finite horizon. Traditional optimal control for linear system solves the algebraic Riccati equation (ARE) off-line. 4,5 The optimal control policy is regulated as a state feedback according to the gradient of the value function. 6 However, this kind of controller may suffer from steady-state error because of the disturbance in the system. 7 And the design of the feedforward part is separate from the optimal regulation.
In this study, the integral compensation and feedforward part are introduced in the optimal controller. The integral term is necessary to maintain the system state around the equilibrium points and reduce the steadystate error. Inspired by the adaptive robust control, a discontinuous projection for the integral compensation is applied to ensure the robustness. 8,9 While the feedforward part can remarkably improve the responses of the control system, 10 which is necessary for high-accuracy control problems. So, an augmented system with the integral compensation, error dynamics, and desired trajectory is established in this study. And the optimal tracking control problem (OTCP) of the augmented system is formulated for minimizing the performance function in the infinite horizon.
However, the introduction of the integral compensation and feedforward brings difficulties to the optimal controller design. The optimal control policy for the augmented system can hardly be obtained by solving the Hamilton-Jacobi-Bellman (HJB) equation directly. And the completely unknown system dynamics are also challenges for the controller design. In this study, a new optimal controller with reinforcement learning techniques is proposed to deal with the problems 1 mentioned above. [11][12][13][14] Different from many state-ofthe-art continuous-time optimal controllers, the proposed controller is built up with a Q-function-based actor-critic architecture. 15 The critic is updated by the Q-function approximation, and the actor is optimized by the deterministic learning method.
The implementation of Q-function approximation in the continuous-time domain is inspired by the integral reinforcement learning (IRL) method. 16,17 An integral of the linear quadratic function is calculated to obtain the Bellman error and update the critic weights. The optimal control policy is obtained by the method of deterministic learning. 18,19 Different from those off-line deterministic policy gradient (DPG) methods, the deterministic learning method in this study enables an online policy update. 20,21 In this article, we developed an adaptive optimal controller based on the deterministic learning technique. The contribution of this article is as follows. First, the integral compensation and feedforward are added in the control input, so that the control performance can be improved. Second, the OTCP of the system can be solved on-line with completely unknown dynamics, by employing Q-function approximation and deterministic learning method. Third, the convergence and Lyapunov stability of the proposed controller are proved. And the effectiveness of the controller is validated by simulation and experiment.
The rest of this article is organized as follows. Section ''Optimal tracking problem formulation'' presents the OTCP for the augmented system. Section ''Optimal controller design'' presents the design of the optimal controller. Section ''Stability analysis'' presents the Lyapunov stability of the proposed controller. Section ''Case study'' presents a case study on a hydraulic loading system. Section ''Conclusion'' presents the conclusion.
Optimal tracking problem formulation

Linear control system with integral compensation
Consider the single-input-single-output (SISO) affine continuous-time linear system described as where x 2 R n is the system state vector, A 2 R n3n and B 2 R n are the unknown system dynamics, u 2 R is the control input, y 2 R is the system output, and C 2 R 13n is the output matrix.
Assumption 1. The system state x and the control input u are contained in compact sets.
Assumption 2. The system with A and B is controllable. The vector Ax(t) and B are bounded.
Assumption 3. Assume that the desired trajectory of the system x d (t) 2 R n is bounded and there exists a Lipschitz continuous function f d such that The tracking error e d (t) 2 R n is defined as The control input u(t) is described as where u f (t) is the optimal control input composed of the feedforward and feedback parts and u s (t) is an integral compensation where k i is the integral coefficient, which keeps unchanged in this study. And the integral term s e (t) 2 R is described as a discontinuous projection of e d where s e max and s e min are the maximum and minimum of s e (t), respectively. It can be seen that u s is a linear function of e d when s e (t) is in the range (s e min , s e max ).
Remark 1. The traditional optimal regulation method obtains a proportional-derivative (PD)-type controller with feedback only, which may cause steady-state error under uncertain dynamics. In this study, an integral compensation is introduced to eliminate the steadystate error.
Augmented system and performance function Define the augmented system state X 2 R 2n as The augmented state vector is composed of the tracking error and desired trajectory.
Then, the dynamics of the augmented system can be written as where the drift dynamics F(X) and the input dynamics G(X) can be written as Remark 2. Because A and B are assumed to be unknown in this study, the dynamics of the augmented system F(X) and G(X) are also unknown.
Remark 3. The coefficient of the integral compensation is set as a constant in this study. Only the feedforward and feedback parts u f (t) need to be regulated. And it is not necessary for the integral term s e (t) to be a part of the state vector. The integral compensation is regarded as part of the system drift dynamics instead. The objective of the OTCP is to minimize the performance function of the augmented system V c (t) is a discounted sum of costs in the infinite horizon. The diagonal matrix Q T 2 R n3n and the real number R T 2 R are coefficients of the quadratic function. The optimal control policy for u Ã f should be obtained on-line under the unknown dynamics According to Leibniz's rule, the derivative of V c (t) can be obtained And the tracking HJB equation can be written as Remark 4. The system dynamics F(X) and G(X) are unknown in this study. So, the traditional linear quadratic regulator (LQR) method is limited in this problem. And the problem can neither be solved by dealing with the HJB equation directly.

Optimal controller design
For systems with completely unknown dynamics, the HJB function can hardly be solved. In this section, an optimal controller is proposed with the actor-critic architecture. The structure of the controller is shown in Figure 1. The Q-value approximation is employed to evaluate the performance function. And the optimal control policy is updated on-line by the deterministic learning technique. The feedback and feedforward parts of the control input can be obtained simultaneously.

Critic network and Q-function approximation
The state vector is pre-processed by a normalization while considering the difference of scale between the desired trajectory x d and the tracking error e d The state vector is transformed as The value function V c (t) is the expectation of the performance function, which can be estimated on-line by the Q-value where e v is the approximation error and the Q-function Q c ( X, u f ) is evaluated based on the actual control input u f .

Remark 5.
During the on-line learning process, a probing noise is added on u f (see section ''Experiment results'' for more detail). So, the difference between the value function and Q-function e v is mainly caused by the probing noise.
Because the amplitude of probing noise is relatively low, the approximation error e v can be kept bounded in a compact set according to Assumption 2 The Q-function can be obtained by a linear approximation where W c (t) 2 R l are the weights for ideal approximation and f c is the basis function. The Bellman equation can be obtained from equation (10), which is written as According to equations (16) and (19), the tracking Bellman error can be written as where According to equations (16) and (18), the tracking Bellman equation error e B (t) can be written as So, e B (t) is bounded according to equation (17). The Q-value is estimated by the critic network According to equation (20), the Bellman error with respect to the weights of the critic networkŴ c can be written as In this study, the policy iteration method is employed to minimize the Bellman error. The objective function of the critic network can be written as The update rate ofŴ c can be written as where u c is the regularization coefficient and k nc is the normalization term The normalization term is applied to limit the update rate of the network weights.
Define the estimation error of the critic network as According to equations (20) and (28), the Bellman error e B (t) can be written as So, the critic neural network (NN) estimation error dynamics becomes where D f c is defined as Actor network and deterministic learning technique The control policy is improved by the actor network. The deterministic learning technique is applied to update the actor network weights on-line. The optimal control policy p Ã f satisfies The optimal control input u Ã f can be expressed by a linear approximation where e u is the approximation error and satisfies u f is the estimation of u Ã f according to the actor network The deterministic learning method is employed to update the weightsŴ a with the Q-value in the critic 22 where u a is a regularization coefficient added to ensure convergence. The term ∂Q c ( X, u f )=∂u f can be written as So, the update rate _Ŵ a can be obtained as where k na is the normalization term Remark 6. The initial weights of the actor network W a (0) should be an admissible control policy which is able to stabilize the system. The estimation errorW a is defined as Then, the estimation error dynamics of weights W a can be written as Persistently exciting condition According to equation (26), the convergence of the weightsŴ c requires the persistent excitation (PE) condition of Df c . For all t50, there exists m 1 . 0 and m 2 . 0 such that According to equation (38), the term r u f f c TŴ c denotes the gradient of the estimated Q-function with respect to the control input.
And the vector f a should satisfy the PE condition to ensure the convergence of the weightsŴ a . For all t50, there exist m 3 . 0 and m 4 . 0 such that However, the PE condition can hardly be verified on-line. 23,24 So, in this study, a probing noise is added on the control input u f .

Stability analysis
In this section, the stability of the proposed method is proved in the Lyapunov sense.
The Lyapunov function is defined as The derivative of the Lyapunov function is given by According to equations (12) and (18), _ V c (t) can be written as Note that u f 2 R, the term u f (t) 2 in equation (46) can be written as The first term in equation (46) can be written as So, the derivative _ V c (t) can be written as According to equation (30), the second term in equation (45) can be written as Using equation (41), the third term in equation (45) can be written as According to equations (23) and (38), _ J 2 (t) can be written as According to the basic inequality, the second term in equation (52) can be written as Using equations (49), (50), and (54), _ where k 1 is written as and k 2 is written as According to the range of e B , and the assumption of f a and f c , it can be concluded that k 1 and k 2 are bounded.
And N 1 and N 2 are written as The regularization coefficients u a and u c should make sure that N 1 and N 2 are larger than zero. Then, _ J(t) becomes negative provided that Case study In this section, the tracking control of a hydraulic loading system for hydraulic motors is taken as a case study. 25 The hydraulic loading system utilizes energy regeneration technique to improve efficiency. [26][27][28] The photograph of the experimental setup is shown in Figure 2. Simulation and experiment results are given to verify the effectiveness of the proposed controller. The objective is to achieve high-accuracy pressure control, which can be defined as an OTCP.

OTCP of the hydraulic loading system
The simplified schematic of the hydraulic loading system with energy regeneration is shown in Figure 3. The hydraulic loading system is used to test the hydraulic motor (Rexroth A2FM63) mounted on the transmission shaft. The system is driven by a variable frequency induction motor (ABB QABP 355L2A). The variable displacement loading pump (Rexroth A6V2F63) regenerates the mechanical energy and adjusts the system pressure. Two flow meters (KRACHT VC12) and pressure sensors (KELLER PA-33X/600BAR) are mounted at two outlets of the tested motor. A personal computer (PC) receives all the sensor signals and sends the control signal by an I/O card (ADVANTECH USB4716). The objective of the OTCP is to obtain the optimal displacement input of the loading pump so that the performance function (10) can be minimized.  The dynamics of the loading pump can be simplified as a first-order system where T m is the time constant of the loading pump, q m is the output displacement of the loading pump neglecting the leakage flow, and u m (cm 3 /rev) is the displacement input of the loading pump.
The system pressure dynamics can be described as where p c (MPa) is the test pressure, V is the volume of the high-pressure chamber, b is the bulk modulus, n is the rotational speed, q c is the rated displacement of the tested motor, and Q l (L/min) is the leakage flow rate of the system. Q l can be linearized as Define the system state and output as According to equations (63) and (64), the hydraulic loading system can be described as a second-order linear system The control input u(t) is written as f a ( X) = ½ X 1 , The controller is designed with k i = 1, k yd = 10, Q T = 10I 2 (I 2 is the 232 identity matrix), R = 1, g = 0:1, u c = 0:01, and u a = 0:02. The sample time T s = 0:01 s. The reinforcement interval T = 0:1 s. The learning rates are chosen as a c = 0:2e À0:05t ð76Þ a a = 0:1e À0:05t ð77Þ

Simulation results
The Notice that the system dynamics A and B are unknown while designing the optimal controller with our proposed method.   Figure 4 shows the control performance of the proposed controller while tracking a non-periodic signal compared with a proportional-integral-derivative (PID) controller. 29,30 The feedback gain of the PID controller K PD is regulated by the LQR method, and the integral coefficient K I is the same as that of the proposed method It can be seen that the tracking error of the proposed controller can be remarkably reduced after the learning process. With the feedforward term in the output, the proposed controller can outperform the PID controller, which is shown in Figure 4(b). Figure 5 shows the system state X. It can be seen that the amplitudes of the four elements are similar after normalization.
The gradient r u fQ c ( X, u f ) is shown in Figure 6. When pressure rises with overshoot, the gradient is positive, and the control input is expected to decrease for improving the control performance.
The convergence during the learning process is shown in Figure 7. The Bellman error keeps bounded and converges to zero gradually. In Figure 7 The initial weights of actor network are set to bê W a (0) = 2, 0, 0, 0 In Figure 7(c), it can be seen that the weightsŴ a finally converge tô W a = 1:905, 2:316, 0:081, 0:834 The feedback and feedforward parts are learned simultaneously.    Figure 8 shows the experiment results of the proposed optimal controller while tracking a non-periodic signal. The performance of the proposed controller is also compared with a PID controller. It can be seen that the difference of tracking errors between the two controllers is relatively small at the beginning. After several seconds for learning, the tracking error of the proposed controller is remarkably reduced. Figure 9 shows the convergence of the proposed controller. It can be seen that the Bellman error keeps bounded under the experimental circumstances. The weights of the criticŴ c finally converge tô W c = 0:255, À 0:217, 0:007, À 0:030, À 0:094, 0:163, ½ À 0:020, À 0:003, 0:084, 0:031, À 0:001, À 0:015,

Experiment results
So, the convergences ofŴ a andŴ c are also validated in the experiment.

Conclusion
In this article, an SISO continuous-time optimal tracking controller is proposed for linear systems with completely unknown dynamics. The proposed controller is different from those conventional proportional-deviation-type optimal controllers in two aspects. First, the integral compensation and the feedforward part are introduced into the controller, so the control performance can be improved. Second, the reinforcement learning techniques are applied in the controller design, and the optimal control policy can be obtained on-line without the prior knowledge of system dynamics. The Lyapunov stability and the convergence of the system have been proved. A case study on a hydraulic loading system with energy regeneration is given to validate the   control performance. The simulation and experiment results have shown the effectiveness of the proposed controller.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research is supported by the National Natural Science Foundation of China (51475414 and 51875504), as well as the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (51821093).