Dynamic visual servoing with Kalman filter-based depth and velocity estimator

Camera calibration error, vision latency, nonlinear dynamics, and so on present a major challenge for designing the control scheme for a visual servoing system. Although many approaches on visual servoing have been proposed, surprisingly, only a few of them have taken into account system dynamics in the control design of a visual servoing system. In addition, the depth information of feature points is essential in the image-based visual servoing architecture. As a result, to cope with the aforementioned problems, this article proposes a Kalman filter-based depth and velocity estimator and a modified image-based dynamic visual servoing architecture that takes into consideration the system dynamics in its control design. In particular, the Kalman filter is exploited to deal with the problems caused by vision latency and image noise so as to facilitate the estimation of the joint velocity of the robot using image information only. Moreover, in the modified image-based dynamic visual servoing architecture, the computed torque control scheme is used to compensate for system dynamics and the Kalman filter is used to provide accurate depth information of the feature points. Results of visual servoing experiments conducted on a two-degree of freedom planar robot verify the effectiveness of the proposed approach.


Introduction
As the computing power of CPUs continues to increase and computer technology keeps improving, the idea of visual servoing has enjoyed huge success in many applications since the debut of the renowned tutorial paper by Hutchinson et al. in 1996. 1 In general, there are two basic visual servoing architectures-image-based visual servoing (IBVS) and position-based visual servoing (PBVS). [1][2][3][4][5] Despite visual servoing systems having many attractive features, their performance has been hindered by issues such as camera calibration error, nonlinear dynamics, and vision latency. Although many approaches on visual servoing have been proposed, [6][7][8][9][10][11][12] only a few of them have taken into account system dynamics in the control design of a visual servoing system. 6,7,11 For a robotic system involving highly nonlinear dynamics, its control performance will not be satisfactory unless the nonlinear dynamics of the system is carefully dealt with. In the work by Corke and Good, 6,7 the dynamics issue of a visual servoing system is investigated and the idea of feedforward control is exploited to cope with the vision latency problem. To ameliorate the poor dynamic response due to the low sampling rate of visual servoing applications, some researchers have exploited the acceleration command, which is computed directly from image information. 13,14 The image-based dynamic visual servoing (IBDVS) 13 architecture is a modified version of the classical IBVS architecture. In the IBDVS architecture, the velocity loop of the robot controller adopts the computed torque control (CTC) scheme, 15 while a conventional feedback-type velocity loop is adopted in the classical IBVS architecture. Since the CTC scheme contains a feedforward compensation term, it is not surprising that the IBDVS architecture yields better control performance than that of the classical IBVS architecture. A similar idea for IBDVS was also proposed by Keshmiri et al. 14 However, the IBDVS architecture only provides the desired joint acceleration command for the CTC scheme. That is, the desired joint angle command and the desired joint velocity command are completely ignored.
In addition, the depth values of feature points are essential in calculating the image Jacobian when implementing the IBVS architecture. One of the easiest methods for estimating the depth values of feature points is to use a binocular camera and the concept of disparity 16 and/or epipolar constraints. 17 However, this kind of approach has drawbacks such as not being robust and not being computationally efficient since two image planes are involved in the calculation. In addition to the above disparity/epipolar constraints-based approaches, the nonlinear observer-based approach and the virtual visual servoing approach 18 can be employed to estimate the depth values of feature points 19,20 as well. Generally, the nonlinear observer-based approach and the virtual visual servoing approach can provide good depth estimation results as long as image measurements are accurate and their noise levels are very low. However, in practice, image noise cannot be ignored; as such, the accuracy of depth estimation when using these approaches may not be consistent.
It is well known that the Kalman filter 21-24 has advantages such as being capable of dealing with the dynamic system with noise and providing good prediction results of system states. Consequently, to alleviate the effects of image noise and vision latency that are encountered in the depth estimation process when implementing the image Jacobian, this article proposes a Kalman filter-based depth and velocity estimator by exploiting the concept of virtual visual servoing and Kalman filter. Furthermore, as mentioned previously, when implementing the CTC scheme in the original IBDVS architecture, only the desired acceleration command is used. It is not a common way to implement the CTC scheme. Therefore, to deal with this problem, in addition to the desired joint acceleration command, in this article, the desired joint velocity command and the desired joint angle command are also used when implementing the CTC scheme. The modified image-based dynamic visual servoing architecture is called MIBDVS in this article. Several experiments have been conducted on a two-degree of freedom (2-DOF) planar manipulator to assess the performance of the proposed Kalman filterbased depth and velocity estimator and the proposed MIBDVS architecture.
According to the above literature review and analysis, the main contributions of this article are summarized in the following.
1. By employing the Kalman filter to cope with image noise, the proposed Kalman filter-based depth and velocity estimator outperforms the one that does not use the Kalman filter. In addition, the proposed Kalman filter-based approach can be employed to estimate the joint velocity of the robot using image information only. 2. By exploiting the desired joint angle command, the desired joint velocity command, and the desired joint acceleration command in the implementation of the CTC scheme, the proposed MIBDVS architecture exhibits better tracking performance than the classical IBVS architecture.
The remainder of the article is organized as follows. The second section briefly reviews the camera model and the IBVS architecture. The third section proposes the Kalman filterbased depth and velocity estimator that can be used to estimate object depth as well as joint velocity. The fourth section introduces the proposed modified image-based dynamic visual servoing architecture. Experimental results and conclusions are given in the fifth and sixth sections, respectively.

Brief review on camera model and camera parameters
Perspective projection (i.e. the pin-hole model) 25 is adopted in this article. In order not to have an inverted image, a virtual image plane which is located between the optical center c O and the object point c P ¼ ½ c x; c y; c z T in the camera frame is used. Intrinsic camera parameters describe the relationship between the coordinate of the object point c P in the camera frame and the coordinate of its corresponding image point p(u, v) on the image plane. In practice, the width and height of a pixel may not be the same. It is reasonable to assume that the focus length l u for the horizontal axis (u-axis) and the focus length l v for the vertical axis (vaxis) are different. In addition, due to imperfections in the manufacturing process, the angle between the horizontal axis and the vertical axis of a pixel is not 90 . In general, a skew factor d is used to describe this phenomenon. Moreover, in order not to have negative pixel coordinates, the origin of the image plane will be moved to the upper left corner instead of the center position. Based on the above discussions and perspective projection, one will have (9) Substituting equation (9) into equation (7) and rearranging terms will result in Equation (10) can be further expressed as

Depth and velocity estimation based on Kalman filter and virtual visual servoing
The image Jacobian matrix described by equation (10) consists of five parameters-l x ,l y , u, v, and c z. However, in practice, the depth value c z of an object point (i.e. feature point) is not often available. Common approaches to coping with this problem including the design of a depth observer or the use of stereo cameras. In this article, a novel approach based on the concept of virtual visual servoing 18 and Kalman filter 21 is developed to estimate the depth value that is essential in calculating the image Jacobian matrix. The idea of virtual visual servoing proposed by Marchan and Chaumette 18 was originally used in augmented reality applications. Since the virtual image must appear at the correct position in the real scene in augmented reality applications, the relationship between the camera frame and the real object is therefore crucial. That is, the calibration accuracy of extrinsic camera parameters is very important. The concept of virtual visual servoing 13 is illustrated in Figure 2 and will be elaborated in the next subsection.

Pose and velocity estimation based on virtual visual servoing
In Figure 2, m Ã ¼ ½u Ã ; v Ã T represents the image point on the image plane corresponding to the actual object point P j , and m ¼ ½u; v T represents the image point on the image plane corresponding to the virtual object point P o before updating the extrinsic camera parameters. If the motion of the camera is properly controlled so that P o converges to P j , one can expect that m will converge to m* as well; namely, o dT j can be found. Since the position of the point P o is randomly initialized, c T o is therefore known. As a result, one can obtain accurate extrinsic camera parameters c T o o dT j as well as o dT j . The concept of IBVS is exploited to find o dT j . Similar to the derivation of equations (10) and (11), one can calculate the corresponding image Jacobian L that describes the relationship between the virtual velocity screw c V vir ¼ u T ðtÞ ! T ðtÞ Â Ã T of P o and the time derivative of image feature m using the following equation The image feature error e vir between m and m* is defined as If the goal is to exponentially converge e vir , one can let Substituting equation (13) into equation (14) will yield Substituting equation (12) into equation (15) will yield From equation (16), one will have As shown in Figure 2, the rigid transformation o dT i consists of the translation increment dP i and the rotation increment dq i described by equations (18) and (19), respectively. Note that t i À t 0 ¼ iÁt, where Át in equations (18) and (19) is the sampling time Note that dq i is the rotation increment around a specific axis and its corresponding rotation matrix dR i can be obtained using the Rodrigues formula 28 or alternatively using the exponential matrix map described by the following equation As illustrated in Figure 2, after the time duration t i À t 0 had passed, the original image point m moved to the new image point m 0 . The original virtual object point P o is updated to P i by repeating using equations (12)- (20). Eventually, the image point will converge to m Ã , while the virtual object point will converge to P j .
One interesting application of virtual visual servoing is that it can be used to estimate the velocity of the actual object point P j . The idea is to integrate the virtual velocity screw c V vir within a fixed time interval Át. By properly adjusting the value of gain constant K in equation (17), it is possible to obtain a specific c V vir so that the resulting rigid transformation will be very close to o dT j . That is, m will converge to m* and P o will converge to P j within one sampling time period. In this case, the virtual velocity screw at P o will be very close to the velocity screw at P j .
To improve the depth estimation accuracy, the acceleration information of the virtual object point P o in the camera frame can be taken into consideration. 13 Detailed derivations are provided in the following.
Suppose that the virtual object point P o undergoes a rigid body motion, 29 To obtain the acceleration term, one can differentiate equation (21) with respect to time to get Suppose that the sampling time Át is very small. The velocity information of the virtual object point P o in the camera frame at time instant t 0 þ Át can be approximated as Substituting equations (21) and (22) into equation (23) Equation (24) can be rewritten as After some manipulations, equation (25) can be further expressed as Equation (26) can be expressed in matrix form as Equation (27) describes the relationship between the velocity c _ P in the camera frame and With the consideration of the acceleration term, equations (18) and (19) can be rewritten as

Depth and velocity estimation based on Kalman filter and virtual visual servoing
Considering the fact that the captured image often contains noise and there are limitations on computational resources and camera sampling rate, this article proposes a depth and velocity estimator that combines the Kalman filter with the virtual visual servoing technique so as to reduce noise effects and also improve estimation accuracy. Figure 3 shows the schematic diagram of the proposed depth and velocity estimator. The discrete-time state equation and output equation of a typical dynamic system can be expressed as where X(k) is the state vector, U(k) is the input vector, and Y(k) is the output vector; x(k) is the process noise vector and h(k) is the measurement noise vector; and A d , B d , and C d are constant matrices of proper dimensions. In this article, the process noise vector x(k) is assumed to be a zero vector. The position and velocity of the actual object point P j in the camera frame are defined as the state variables X(k) in equation (33). In addition, the acceleration of the actual object point P j in the camera frame is defined as the input U(k) (equation (33)  (35) In the following, we will determine the transformation matrix C d between the system statesX ðkÞ and the measured output Y(k). The system states x(k), y(k), and z(k) can be estimated by using perspective projection, current image point m Ã , previous estimated depth value z(kÀ1), and the virtual servoing technique. In addition, the other three system states u x ðkÞ, u y ðkÞ, and u z ðkÞ can be estimated using the virtual servoing technique. Since all the states can be either directly or indirectly estimated/measured, C d is an identity matrix. Therefore The Kalman filter-based depth and velocity estimator is implemented using equations (33)-(37) where K(k) is the Kalman filter gain matrix and S(k) is the covariance matrix for the state estimateX ðkÞ. Equation (37) gives the state estimate; namely, the depth value and velocity can be estimated. In this article, the covariance matrix R for the measurement noise h(k) is determined through a trial-and-error manner, whereas the covariance matrix Q for the process noise x(k) is set to a null matrix in equation (37). The proposed depth and velocity estimator that combines the Kalman filter with the virtual visual servoing technique is easy to implement. It is used in the proposed MIBDVS architecture that will be investigated in the next section. In particular, the proposed Kalman filter-based depth and velocity estimator is used in the MIBDVS architecture to estimate the parameter values of the interaction matrix. It is worth noting that the virtual visual servoing technique exploits the idea of IBVS. As a result, the virtual visual servoing technique inherits the drawbacks of IBVS as well. For instance, if the straight line that passes the real object point and the virtual object point is parallel with the optical axis, then their corresponding image points on the image plane will coincide with each other. In this case, it is impossible to exploit the error between these two image points to estimate the position/velocity of the real object point. Nevertheless, the user can choose the initial position of the virtual object point to avoid such a case happening.

Dynamic model of a 2-DOF planar robot manipulator and CTC
The dynamic model of a 2-DOF planar robot manipulator can be described by where t is the 2 Â 1 torque vector; M(q) and Cðq; _ qÞ are the 2 Â 2 inertia matrix and Coriolis matrix, respectively; Fð _ qÞ is the 2 Â 1 friction vector; and q is the 2 Â 1 generalized coordinate (in this article, q is the 2 Â 1 joint angle vector).
Unlike most classical visual servoing schemes which only use a proportional-type feedback control law, both IBDVS and the proposed MIBDVS exploit the idea of CTC. 15,30,31 In general, the CTC law t ctc can be expressed as whereM ðqÞ,Ĉðq; _ qÞ, andF ð _ qÞ are the estimated inertia matrix, Coriolis matrix, and friction vector, respectively, as obtained through system identification 32,33 ; q d is the 2 Â 1 desired joint angle vector; e q ¼ q À q d is the 2 Â 1 joint angle error vector; K D is the 2 Â 2 constant derivative gain diagonal matrix; and K P is the 2 Â 2 constant proportional gain diagonal matrix. Suppose that the system identification results are perfect; that is,M ðqÞ ¼ MðqÞ,Ĉðq; _ qÞ ¼ Cðq; _ qÞ,F ð _ qÞ ¼ Fð _ qÞ. Letting t in equation (38) be equal to t ctc described by equation (39) will yield Since the inertia matrix M(q) is a nonsingular square matrix, multiplying the inverse matrix of M(q) on both sides of equation (40) will lead to € e q þ K D _ e q þ K P e q ¼ 0 (41) One interesting observation is that the CTC method can yield satisfactory performance if the dynamic model obtained through system identification is accurate. However, if the identified dynamic model is not accurate, then the CTC method may result in poor control performance. Figure 4 shows a typical block diagram of CTC. Figure 5 illustrates the control block diagram of IBDVS. The IBDVS incorporates a depth and velocity estimator, a second-order visual loop controller, and a robot control loop that uses the position feedback provided by the encoder. The IBDVS architecture is similar to the classical IBVS architecture. Both the IBDVS architecture and the classical IBVS architecture use the image feature command for the visual loop. The difference is that in the IBDVS architecture, the velocity loop of the robot control architecture adopts the CTC scheme rather than the conventional feedback controller. However, as shown in Figure 5, the IBDVS architecture only provides the desired joint acceleration command € q d for the CTC scheme from the visual loop. That is, the desired joint angle command q d and the desired joint velocity command _ q d are completely ignored. It is not a common way to implement CTC. Therefore, to cope with this problem, in this article,q d , _ q d , and € q d are all used in the CTC scheme. The modified visual servoing architecture is called MIBDVS in this article. Figure 6 shows the block diagram of the proposed MIBDVS architecture. Note that in the proposed MIBDVS architecture, the depth and velocity estimator is implemented based on the Kalman filter.  In Figure 6, the depth and velocity estimator estimates the parameter values essential in the calculation of interaction matrixL, and IDMðq;_ qÞ is the inverse dynamic model. 13 Figure 7 shows the detailed computations of the commands used in the proposed MIBDVS architecture. As shown in Figure 7, using the Kalman filter-based depth and velocity estimator and the image featurem, the position feedback needed in the CTC scheme can be obtained.

Controller design of MIBDVS
The controller design of the MIBDVS architecture in Figure 6 will be explicated in the following. The task function E is defined by equation (42) Suppose that the goal is to converge image feature error to behave as a second-order system. As a result, one will have equation (43), where L v and L p are suitable gains designed by users Substituting equation (42) into equation (43) will yield The velocity command c V and acceleration command c _ V are obtained by equations (28) and (44) as follows

Image feature command generation and interpolation
In the experiment, the image feature command is generated through the so-called teach by showing method. During the "teach by showing" stage, the user holds and moves a fiducial marker to the goal position and the camera is used to record the entire moving trajectory of the fiducial marker.
In the "execution" stage, the recorded moving trajectory is adopted as the image feature command for the visual servoing scheme and the selective compliance assembly robot arm (SCARA) robot is controlled to repeat (i.e. move along) the recorded moving trajectory. Note that in this article, the recorded moving trajectory is represented by a PH curve. 34,35 Experimental setup and results Figure 8 shows the experimental system that consists of a 2-DOF SCARA robot (as shown in Figure 9), two eye-tohand cameras (mounted on the ceiling as shown in Figure 10), a personal computer, and an intelligent motion control platform-2 card by Industrial Technology Research   Institute, Zhudong Township. Note that two eye-to-hand cameras are used in the hand-eye calibration process 36 (for later use in the joint velocity estimation experiment). When performing visual servoing, only the left eye-tohand camera (the camera denoted as "L" in Figure 10) is used. The two joints of the planar robot are actuated by two AC servomotors and the motor drives are set to torque mode throughout the experiments. In particular, the "L" eye-to-hand camera, which is equipped with a lens of 16 mm focus length, has a maximum resolution of 1280 Â 1024 pixel 2 and 60 Hz frame rate. In addition, the distance (measured by a ruler) between the "L" eye-tohand camera and the 2-DOF SCARA robot is around 135 cm.

Experimental results of Kalman filter-based joint velocity estimation
In this experiment, the SCARA robot is controlled to perform a contour following motion. Three different approaches-the depth and velocity estimator without incorporating the Kalman filter, the proposed Kalman filter-based depth and velocity estimator, and the least square fit (LSF) method 37 -are used to estimate the joint velocity of the robot. In particular, the LSF method uses the encoder data of the servomotor installed at each joint to estimate the joint velocity, whereas the other two approaches only use the image information obtained by the camera to estimate the joint velocity. Since the resolution of the encoder data is much higher than that of the image  data provided by the camera, it is expected that the estimation accuracy of the LSF method will be better than the other two approaches. Therefore, the estimation results of the LSF method will be used as a reference to assess the estimation accuracy of the proposed Kalman filter-based depth and velocity estimator in addition to the depth and velocity estimator without incorporating the Kalman filter. Note that in this experiment, the object feature point is on the tip of the second joint (i.e. end-effector). The depth and velocity estimator without incorporating the Kalman filter as well as the proposed Kalman filter-based depth and velocity estimator can estimate the velocity of the endeffector in the camera frame using image information only. By exploiting the results of hand-eye calibration and inverse robot Jacobian, one can convert the velocity of the    end-effector in the camera frame into the joint velocity of the robot. According to the joint velocity estimation results shown in Figures 11 and 12, the estimation performance of the proposed Kalman filter-based depth and velocity estimator is clearly better than that of the depth and velocity estimator without incorporating the Kalman filter.

Experimental results of Kalman filter-based depth estimation
In this experiment, the SCARA robot is controlled to perform a contour following motion. Two different approaches-the proposed Kalman filter-based depth and velocity estimator and the depth and velocity estimator without incorporating the Kalman filter-are tested in the experiment. Note that in this experiment, the depth of the robot is estimated using the image information only. In addition, the ground truth of the object depth measured by a ruler is around 135 cm. Results of the depth estimation experiment are shown in Figure 13. Clearly, the proposed Kalman filter-based depth and velocity estimator exhibits better depth estimation accuracy than the depth and velocity estimator without incorporating the Kalman filter.

Comparison of tracking performance between IBVS and MIBDVS
In this experiment, the SCARA robot is controlled to perform a contour following motion. Both the classical IBVS and the proposed MIBDVS are tested. Figure 14 shows the desired contour. Figure 15 shows the image command after interpolation, whereas Figure 16 shows the image velocity command. Tracking results on the image plane are shown in Figure 17, whereas Figure 18 shows the tracking errors of the image features. In addition, the performance comparison between the classical IBVS and the proposed MIBDVS is summarized in Table 1, where "RMS" represents the root-mean-square value and "MAX" is the maximum value. Based on Table 1, clearly, both the RMS values and the MAX values of tracking error on the u-axis and v-axis for the case of the proposed MIBDVS are smaller than those for the case of the classical IBVS. In addition to tracking error, contour error-an important indicator of contour following accuracy-is also compared. Again, both the RMS values and the MAX values of contour error for the case of the proposed MIBDVS are smaller than those for the case of the classical IBVS. Experimental results indicate that the proposed MIBDVS structure outperforms the classical IBVS structure in both tracking performance and contour following accuracy.

Conclusions
This article exploits the concept of virtual visual servoing and Kalman filter to develop a method for estimating the depth value that is essential in calculating the image Jacobian matrix used in IBVS architectures. In particular, the Kalman filter is employed to cope with image noise so as to improve the accuracy of depth estimation. In addition, the proposed Kalman filter-based approach is also employed to estimate the joint velocity of the robot using image information only. Moreover, to achieve better visual servoing performance, this article proposes the MIBDVS architecture that exploits the desired joint angle command, the desired joint velocity command, and the desired joint acceleration command in the implementation of the CTC scheme. Several experiments conducted on a 2-DOF planar manipulator are used to evaluate the performance of the proposed Kalman filter-based depth and velocity estimator and the proposed MIBDVS architecture. Experimental results indicate that the two proposed approaches outperform the ones based on the classical IBVS architecture. In this article, the inertia matrix, Coriolis matrix, and friction vector, which are essential in the implementation of the CTC scheme, are obtained through system identification. However, the accuracy of the identification results of these matrices/vectors greatly affects the effectiveness of the CTC scheme as well as that of the proposed MIBDVS architecture. Improving identification accuracy is one possible future direction. In addition, the sampling rate for the inner servo control loop is often more than 10 times that for the outer vision loop. This results in a major challenge for the control design of MIBDVS. How to ease this difficulty so as to facilitate the control design of MIBDVS is another possible research direction.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The project is supported by the Ministry of Science and Technology, Taiwan, under MOST 105-2221-E-006-105-MY3.