Computationally efficient visual– inertial sensor fusion for Global Positioning System–denied navigation on a small quadrotor

Because of the complementary nature of visual and inertial sensors, the combination of both is able to provide fast and accurate 6 degree-of-freedom state estimation, which is the fundamental requirement for robotic (especially, unmanned aerial vehicle) navigation tasks in Global Positioning System–denied environments. This article presents a computationally efficient visual–inertial fusion algorithm, by separating orientation fusion from the position fusion process. The algorithm is designed to perform 6 degree-of-freedom state estimation, based on a gyroscope, an accelerometer and a monocular visual-based simultaneous localisation and mapping algorithm measurement. It also recovers the visual scale for the monocular visual-based simultaneous localisation and mapping. In particular, the fusion algorithm treats the orientation fusion and position fusion as two separate processes, where the orientation fusion is based on a very efficient gradient descent algorithm, whereas the position fusion is based on a 13-state linear Kalman filter. The elimination of the magnetometer sensor avoids the problem of magnetic distortion, which makes it a power-on-and-go system once the accelerometer is factory calibrated. The resulting algorithm shows a significant computational reduction over the conventional extended Kalman filter, with competitive accuracy. Moreover, the separation between orientation and position fusion processes enables the algorithm to be easily implemented onto two individual hardware elements and thus allows the two fusion processes to be executed concurrently.


Introduction
The combination of visual and inertial sensors has been shown to be viable, and the significant performance improvement over a single sensor system has attracted many researchers into the field after the success of SFly project, 1 which enabled the world's first autonomous unmanned aerial vehicle (UAV) in Global Positioning System-denied environments. 2n the past 5 years, many prominent research institutions began to develop advanced monocular visual-based simultaneous localisation and mapping (mSLAM) algorithms based on structure from motion (SFM) theory, [3][4][5][6][7][8][9][10][11] which are suitable to modern onboard embedded computers.][14][15][16][17][18][19][20] Almost all of the visual-inertial fusion algorithms, to our knowledge, rely on nonlinear Kalman filter (KF) techniques (extended KF, unscented KF, etc.) to process both the orientation and the position measurement in the same process, this results in a large state vector (usually more than 20 states) and a complex nonlinear system model.However, recent advances in computationally efficient inertial measurement unit (IMU) orientation estimation 21 show a competitive accuracy against Kalman-based algorithms by utilising optimisation-based methods.Thus, in this article, a computationally efficient visual-inertial fusion algorithm is proposed by separating the orientation and the position fusion processes, this maintains the same level of accuracy with nonlinear KF approach.The algorithm is designed to perform a 6 degree-of-freedom (DOF) state estimation, based on a gyroscope, an accelerometer and an mSLAM measurement.It also recovers the visual scale for the mSLAM.
The rest of this article is organised as follows: section 'Algorithm preliminaries' gives an overview of the visual-inertial fusion algorithm; section 'Orientation fusion process' presents the mathematical expression of the orientation filter; and section 'Position fusion process' presents the mathematical expression of the position filter.The implementation, test result and conclusion are shown in the last three sections.

Coordinate system
The coordinate system used is shown in Figure 1.All the coordinate frames are defined following the righthand rule.The earth frame fEg is fixed to the world, and z E Àaxis is defined to be parallel to gravity vector.The sensor frame fSg is shared by the gyroscope, the accelerometer and the camera, where x S Àaxis points to sensor front and z S Àaxis points to sensor top.The vision frame fV g is the world frame assumed in the mSLAM algorithm, in which the projection of x v Àaxis on the x E À y E plane is parallel to x E , and the orientation of z V Àaxis is arbitrary depending on how the mSLAM is initialised.Note that there are two assumptions under this coordinate system: first, the origin of fEg is assumed to be very close to the origin of fV g (here, we separate the two frames in Figure 1 for the sake of clearance); second, the x V Àaxis is assumed to not be perpendicular to the x E À y E plane in fEg.
The orientation of fSg with respect to fEg can be expressed as q ES 2 R 4 in quaternion or R ES 2 SO(3) in rotation matrix and the position as p ES 2 R 4 .Similarly, the orientation of fSg with respect to fV g can be expressed as q VS 2 R 4 or R VS 2 SO(3) and the position as p VS 2 R 4 .

Algorithm overview
As shown in Figure 2, the visual-inertial fusion algorithm assumes that rotation q VS , as well as the unscaled position p VS is provided by an mSLAM algorithm, which is treated as a black box.Moreover, it receives angular rates measurement v S 2 R 4 from gyroscope and acceleration measurement a S 2 R 4 from accelerometer.The output of the fusion process is to estimate rotation q ES and position p ES 2 R 3 of fSg in fEg.Furthermore, the position filter also estimates the linear velocity v ES 2 R 3 , linear acceleration a ES 2 R 3 and accelerometer bias b ES 2 R 3 , as well as the metric scale of the mSLAM position measurement l.
The fusion is separated into two fusion processes: orientation fusion process and position fusion process.The orientation fusion is based on very efficient gradient descent algorithm, 21 and position fusion is based on a 13-state linear KF.The following two sections will  present the mathematical expressions of the two algorithms, respectively.

Orientation fusion process
The orientation fusion algorithm is based on the gradient descent algorithm in quaternion representation.The origin of the algorithm comes from 21 where the detailed mathematical derivation and proof are presented.However, different from the original algorithm, the following fusion derivation eliminates the magnetometer sensor, while, instead, the rotation correction about gravity vector is compensated by fusing the vision (mSLAM) measurement.Therefore, it avoids the problem of magnetic field distortion, thus only factory calibration is required once for accelerometer.
Moreover, given that all the mSLAM orientation measurements tend to drift over time and distance due to the accumulated error, fusing the vision measurement in the same way as the magnetometer will result in the estimation error on gravity direction.This can be very sensitive for the quadrotor stabilisation and control.Thus, the following algorithm decouples vision measurement with the gravity direction estimation, while maintaining the effective fusion about the gravity vector.
In order to perform orientation estimation, as shown in Figure 3, three coordination frames are used: sensor frame fSg represents the orientation of all the coincide sensors (gyroscope, accelerometer and camera); earth frame fEg represents the reference frame of the inertial sensors (gyroscope and accelerometer) and vision frame fV g represents the reference frame of the mSLAM algorithm based on the camera.Additionally, g E is the unit vector representing the true gravity direction, which is also called gravity field vector in the rest of the article, and proj½x V is the unit projection vector of the x V Àaxis onto the x E À y E plane, which is also called vision field vector in the rest of the article.
The purpose of the orientation fusion is to estimate optimal quaternion transformation q ES from fEg to fSg so that (1) the z E Àaxis aligns with the gravity field and (2) the x v Àaxis aligns with the vision field.
The essential mathematical expression of one iteration at time t is shown as follows.Note that the orientation estimation from last iteration q ES, tÀ1 is assumed to be known, and the sampling period is denoted as Dt.

Orientation derivative estimated by gyroscope
The three-axis gyroscope measures the angular velocity (rate) in rads about the three axes of fSg frame, which we denote as v S = ½0, v x , v y , v z T .The quaternion derivative of the gyroscope estimation _ q v, t at time t can be computed by equation ( 1), given q ES, tÀ1 as the previous orientation estimation from all sensors where denotes a quaternion product.Note that all the quaternions in this article follow q = ½q w , q x , q y , q z T and they are all unit quaternions (q 2 w + q 2 x + q 2 y + q 2 z = 1).

Orientation optimisation from homogeneous field
In order to obtain the optimal orientation estimation q ES = ½q 0 , q 1 , q 2 , q 3 T , the field measured from the sensor has to be aligned with the predefined reference field as close as possible.Thus, this can be formularised as an optimisation problem, where, for any homogeneous field b E 2 R 4 in fEg, equation ( 2) is solved to minimise an objective error function where s S 2 R 4 is the field vector measured by the corresponding sensor in fSg.The gradient descent algorithm is one of the most computationally efficient optimisation algorithms to solve the above problem.Equation (3) describes it for n iterations, which starts from initial orientation estimation q ES, 0 to final estimation q ES, n + 1 where m is the a non-negative scalar, named as step-size, and the error direction Df (q ES, k , b E , s) is computed by the objective error function f and its Jacobian matrix Gravity field objective error function.Gravity field objective error function represents the error between gravity vector in principle and the sensed gravity acceleration vector, expressed in fSg.Thus, given the gravity field vector g E , the gravity field objective error function f g is defined as where q H ES is the conjugate of q ES , and a S = ½0, a x , a y , a z T is the normalised accelerometer measurement.Since g E = ½0, 0, 0, À 1 T , then it can be further simplified as equation ( 6) Therefore, its Jacobian matrix is Vision field objective error function.Vision field objective error function represents the difference between the vision field vector proj½x V and the x E Àaxis of fEg represented in fSg, thus the vision field objective error function f v is defined as equation ( 8) where x VS is proj½x V represented in fSg as shown in Figure 3.It is treated as a constant vector once been pre-computed by (equations ( 9)-( 11)) where x V 3 T is the unit x V Àaxis vector in fEg, and proj½x V is the normalised projection of x V onto the x E À y E plane measured by mSLAM algorithm, which can be computed by equation (10).Since x E = ½0, 1, 0, 0 T , it can then be simplified as where x VS = ½0, x VS1 , x VS2 , x VS3 .Thus, its Jacobian matrix is

Fusion of three sensors
As stated in the gradient descent algorithm, 21 given that the convergence rate of the estimated orientation is equal or greater than the angular rate of the physical orientation, only one iteration is required to be computed per sample time, Dt.Therefore, an unconventional gradient descent algorithm is derived to fuse all the three sensor measurements.To compute the orientation in next time stamp q ES, t + 1 , the process is summarised as where Df = Df k k can be assigned a physical meaning as the normalised direction of the error of _ q ES, t + 1 , and it can be expressed as _ q e, t + 1 Moreover, b is the only adjustable parameter of this filter.It represents the magnitude of the gyroscope measurement error, which is removed in the direction according to the accelerometer and vision sensor.The higher the b, the faster that the estimated orientation converge to the accelerometer estimation.The theoretically optimal value of b is where ṽmax is the maximum gyroscope measurement error for each axis.Moreover, since IMU and the vision sensor operate in different speed, depending on which sensor measurement is available, Df can then be expressed as where f g V (q ES , a S , x VS ) and J g V (q ES ) are the combined measurements of both field from the sensors, which can be expressed as

Gyroscope bias online estimation
Given the fact that the gyroscope bias drifts with temperature and motion, in practice, any high-accuracy fusion algorithm must be able to estimate the varying gyroscope bias online.Kalman-based methods generally cope with this by introducing the bias variables into the state vector.However, a much more computationally efficient estimation method is used by DC component of the angular error feedback, similar with Madgwick. 22he normalised direction of the error in the rate of change of orientation, _ q e , which is defined by equation ( 16), can be converted to the angular error v e in fSg frame by inverting equation ( 1).This yields The gyroscope bias, v b , can then be represented as the DC component of v e and thus can be removed from the gyroscope measurement, v S , as the integral of v e , weighted by a gain, z where v c is the corrected gyroscope reading, thus it can be used to replace the raw gyroscope reading, v s , in equation (1).
Here, the weighting factor, z, decides the convergence rate of the gyroscope bias estimation, where the higher the z, the faster and noisier the convergence.While the theoretical optimal value of z is defined as where _ ṽmax is the maximum rate of change of the gyroscope measurement error of each axis.

Position fusion process
This position fusion algorithm assumes that the orientation of the sensors is known and only estimates the translational state of the system.Being able to avoid estimating the orientation not only reduces the length of the state vector significantly but also keeps the system model almost linear.It takes three inputs: (1) the orientation estimation q ES in fEg from the result of the orientation fusion process; (2) the raw sensor acceleration measurement r a S 2 R 4 (m/s 2 ) from accelerometer and (3) the unscaled position p VS 2 R 4 and orientation q VS 2 R 4 in fV g from the mSLAM.It outputs its state vector, which contains position estimation p ES 2 R 3 , velocity estimation v ES 2 R 3 and acceleration estimation a ES 2 R 3 , in fEg, and accelerometer bias b S 2 R 3 , as well as the metric scale l.0 of the mSLAM position estimation, which is defined as p VS = lp VS .The position fusion algorithm is formed of a coordinate frame management process and a 13-state linear KF.The KF conducts in the earth frame fEg, thus all the sensor measurement values have to be converted to fEg in the coordinate frame management process.

Coordinate frame management process
Dynamic acceleration in earth frame.Different from the orientation fusion process, in the position fusion process, we consider that the accelerometer not only measures the gravity but also measures the pure dynamic acceleration caused by the movement of the body frame.And since the orientation estimation q ES obtained from orientation fusion process recovers the gravity vector g E = ½0, 0, 0, À 1 in fEg, therefore the dynamic acceleration s a ES 2 R 3 in fEg can be easily obtained by recalling the normalised accelerometer measurement a S = r a S =jj r a S jj.
Unscaled vision position in earth frame.The unscaled position estimation from mSLAM algorithm s p ES 2 R 3 in fEg can be obtained by Therefore, the resulting measurements in fEg frame ( s a ES and s p ES ) can be passed to the position KF as two individual sensor measurements, which will be described as follows.

Linear Kalman filter
The conventional KF framework consists of a prediction step, which performs the state vector time update in constant time interval; and a measurement update step, which performs the correction of the state vector based on the new sensor measurement.Here, in order to encounter the asynchronous measurements from both accelerometer and mSLAM algorithm, two different measurement update models are constructed and will be executed depending on which sensor measurement is available.
State representation and prediction model.The state of the KF is represented as a state vector x 2 R 13 where position estimation p ES , velocity estimation v ES and acceleration estimation a ES are in fEg, and accelerometer bias b S is in fSg, as well as the metric scale l.0 of the mSLAM position estimation, which is defined as p VS = lp VS .The state vector is updated once every time interval, following the rule defined by the prediction model, which defines the physics of the inertial system.It is summarised as The linear acceleration a ES , accelerometer bias b S and mSLAM metric scale factor l are modelled as Gaussian random walk.Thus, n a 2 R 3 , n b 2 R 3 and n l are independent zero-mean normal distribution Gaussian process noise where Q a = s 2 a I 3 and Q b = s 2 b I 3 are the process noise covariance of acceleration and accelerometer bias, respectively.s a , s b and s l are the standard deviations of n a , n b and n l , respectively, and I 3 is three-by-three identity matrix.
Following Welch and Bishop, 23 the discrete KF time update includes two steps: 1. State vector propagation to predict the state vector in next time step where A is the state transition matrix, which is the Jacobian matrix of partial derivatives of the prediction model with respect to x.
2. State covariance matrix P 2 R 13 3 13 propagation to predict the state noise in next time step where W is the Jacobian matrix of partial derivatives of the prediction model with respect to the noise vector, P is the state covariance matrix and Q = diag(0 6 3 6 , Q a , Q b , s 2 l ) is the process noise covariance matrix.

Measurement model. The measurement model is derived in the form of
where z H is the measurement from the mSLAM vision sensor or the IMU, H H is the measurement model matrix and e H denotes the measurement error from the sensor, where H can be as or vs depending on which sensor measurement is available between acceleration sensor measurement and vision sensor measurement.
Here, e H is also modelled as independent zero-mean normal distribution Gaussian process noise where R H is the measurement noise covariance of e H .When the accelerometer measurement s a ES is available, the KF performs accelerometer measurement update, to adjust state vector and state covariance matrix according to the accelerometer measurement model.Here, the accelerometer measurement model is defined by matrix H as 2 R 3 3 13 where R ES 2 SO(3) is the corresponding rotation matrix to q ES and the accelerometer measurement noise covariance R as = s 2 as I 3 , where s as is the standard deviations of e as .When the unscaled position measurement s p ES is available from mSLAM algorithm, the KF performs vision measurement update to adjust state vector and state covariance matrix according to the vision measurement model.Here, the vision measurement model is defined by matrix H vs 2 R 3 3 13 And the vision measurement noise covariance R vs = s 2 vs I 3 , where s vs is the standard deviations of e vs .Following Welch and Bishop, 23 the measurement update steps are summarised as follows: 2. State vector update 3. State covariance update Measurement update process handles different sampling rate between mSLAM and IMU estimation, by only updating state with the corresponding measurement, which becomes available.Thus, by assuming that the orientation fusion reaches steady state, the state vector x can be effectively estimated over time.

Implementation
This section describes the details on the algorithm implementation on an embedded platform.FreeIMU v0.4.3 (http://www.varesano.net/projects/hardware/FreeIMU) hardware was used as the IMU, which includes an MPU6050 gyroscope-accelerometer combo chip, an HMC5883 magnetometer and MS5611 highresolution pressure sensor.However, only the MPU6050 was used in the algorithm.We performed orientation estimation in Teensy 3.1 (https:// www.pjrc.com/teensy),which features an ARM Cortex-M4 processor with 96 MHz clock cycle.Besides, the real-time video frame was captured by an onboard uEye global shutter monocular camera in maximum 80 frames/s.Then, both the video frame and the orientation estimation are processed by the semidirect monocular visual odometry (SVO) mSLAM framework with the extended Kalman filter (EKF) position fusion algorithm operating in parallel on Odroid-U3 (http://www.hardkernel.com/)single boardembedded Linux computer, which features an 1.7 GHz Quad-Core processor with 2 GB RAM.The communication between software packages is realised by Robot Operating System (ROS) (http://www.ros.org).
The entire system was installed on a quadrotor platform, 24 as shown in Figure 4.The autopilot board including the FreeIMU, Teensy processor, servo controller and XBee radio is shown in Figure 4(a).Since the components are overlaid to minimise the size, Figure 5 shows the top and bottom layers of the disassembled autopilot board, indicating the position of the FreeIMU and Teensy processor.The uEye camera and Odroid-U3 computer is installed underneath the quadrotor as shown in Figure 4(b).
The Teensy processor is capable of executing the orientation fusion alongside with autopilot control algorithm at 300 Hz while communicating with Odroid-U3 computer with ROS protocol, including publishing orientation estimation and acceleration measurement at 100 Hz and subscribing the pose estimation from SVO mSLAM framework in Odroid-U3.Moreover, in the Odroid-U3 computer, the SVO mSLAM is executed at 40 FPS with the KF position fusion algorithm running at 100 Hz in parallel.
We measured the simultaneous localisation and mapping (SLAM) processing delay and set the message buffer manually without hard synchronisation.The parameters set-up for both orientation and KF position fusion algorithm is summarised in Table 1.b was left as default value, 0:5, by assuming ṽmax was approximately   0:58 rad=s.The value of s as was selected according to the accelerometer standard noise from the data sheet of the MPU6050 and s vs was set to be 0.005 in the map scale.s a , s b and s l were manually tuned through experiments.Here, we assumed accelerometer bias and visual scale change very slowly, thus s b and s l were set very small.s a determines the confidence level for the prediction model, which means the higher was the s a , the less confidence is the estimator about its prediction model, and thus the easier the sensor measurements affect the estimation.

Test results
Two test trials were performed to shown the effectiveness of the algorithm.
The first trial focuses on evaluating the scale factor (l) estimation from an arbitrary initial value.The trial was conducted in real time under general indoor condition and handholding the quadrotor with gentle moving within 0:3 3 0:3 3 0:3 m 3 space.Note that the position fusion assumes that the orientation fusion reaches the steady state before initialisation.The KF position fusion algorithm is initialised with the state vector x 0 = ½0 1 3 12 , 10 T ; note that we initialise the scale factor l to 10 as a arbitrary positive value to show how it converges to the true value.As shown in Figure 6, the initialisation occurs at 227 s and the record shows a 39-s trial, indicating how the true scale factor is recovered, and how the output position estimation relates to the raw input position measurement over the same period of time.It is clear that the scale factor l is   effectively discovered as 1:26, despite that its initial value is set to 10, as shown in Figure 6(a).And during the converging period, the position estimation output from KF position estimator is scaled accordingly with l over time.Since KF position estimator fuses the acceleration with SVO position measurement, the output position estimation performs better in dynamic operation.
The second trial evaluates the position and yaw angle estimation performance by comparing the estimation against the ground truth.The 6-DOF ground truth was provided by VICON (https://www.vicon.com)motion capturing system.Again, the movement was generated by handheld motion in about 2 3 2 3 2 m 3 space for 31-s duration.The estimated scale factor of the SVO is shown in Figure 7.The position estimation for each axis is shown in Figure 8, and the yaw estimation is shown in Figure 9.It is shown that the system has a good orientation and position tracking, with 0.1 radians orientation error and 0:1 m position error.However, the position error is believed to be mainly introduced by the small error in visual scale estimation, which could be the result of small timing error from manual synchronisation or slow converging rate close to the true value.Moreover, the latency shown in the two graphs are from the wireless communication from the two source systems.The three-dimensional (3D) position can be illustrated as shown in Figure 10.

Conclusion and future work
This article has shown the design and implementation work of a sensor fusion framework, which is capable of performing the 6-DOF sensor state estimation, by the fusion of a three-axis gyroscope, a three-axis accelerometer and an mSLAM algorithm.
Two test trials were carried out to show the effectiveness of the system.The first trial shows that scale factor of the mSLAM can be sufficiently estimated from an arbitrary value, thus the position output is scaled accordingly.Whereas in the second trial, the comparison against ground truth shows that the 3D position can be effectively estimated, and the drift-free yaw rotation can be estimated accurately without magnetometer.
The future work includes estimation of error comparison and computational performance evaluation by benchmarking with other existing fusion algorithms.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figure 3 .
Figure 3. Gravity field and vision field.

Figure 6 .
Figure 6.KF position fusion result: (a) scale factor l, (b) the unscaled position measurement from SVO SLAM and (c) the scaled position measurement from KF position estimator.

Figure 10 .
Figure 10.Three-dimensional trajectory illustration against ground truth comparison.

Table 1 .
Parameter set up.