A patch-based real-time six degrees of freedom object pose refinement method for robotic manipulation

A fundamental vision technique for industrial robots involves the six degrees of freedom pose estimation of target objects from a single image. However, the direct estimation of the six degrees of freedom object pose solely from a single image is subject to limited accuracy. Various refinement approaches have been proposed to improve the accuracy by utilizing rendered images from a 3D model. Nevertheless, balancing speed and accuracy in an industrial setting remains a challenge for these methods. In this study, we propose a novel six degrees of freedom pose refinement approach centered around matching real image patches. Unlike previous approaches, our method does not rely on a 3D model, resulting in increased speed and accuracy. In the offline phase, we construct an offline database using image patches obtained from real images. During the inference phase, our method initially identifies the image patch within the offline database that is closest to the initial pose. Subsequently, we refine the six degrees of freedom pose by matching the corresponding image patches from the offline database. Experimental results indicate that our six degrees of freedom pose refinement method achieves real-time capability with a frame rate of 71 Frames Per Second (FPS), along with high precision. When the threshold is set to 0.5% of the object diameter, the average distance of dots score on the test data surpasses 70%. Moreover, experiments involving gripping and assembling tasks on an industrial robot demonstrate the ability of our method to autonomously select appropriate grasping angles and positions in real time. It further generates suitable motion paths, ultimately ensuring production efficiency.


Introduction
Industrial robots are widely used in electronic, manufacturing, logistics, chemical, and other industrial fields.They can complete the work of grasping, transporting, welding, spraying, assembling, and so on.With the support of Artificial Intelligence technology, robots can have higher efficiency, lower energy consumption, and better environmental adaptability.To achieve high-quality development, the requirements for robot's accuracy are now greatly increased.For example, assembly accuracy has been improved from the previous millimeter level to the submillimeter level, such as smart part assembly.The grasping point of the industrial robot of the existing production line is fixed, and the workpiece can only be placed in a fixed position in advance in a fixed pose, which is difficult to meet the complex industrial production requirements and inefficient.Using cameras as the primary sensor device to perform high-precision assembly tasks in a semi-structured environment remains a serious challenge for robots.
High-precision pose estimation plays a crucial role in robot grasping and assembly, enabling precise manipulation of objects.Some six degrees of freedom (6 DOF) pose estimation method 1 employ the red, green, blue (RGB) 2,3 image to predict key points and employ the perspective-npoint (PnP) algorithm 4 for object pose estimation.However, achieving high accuracy is challenging due to the PnP algorithm's sensitivity to small errors in the 2D representation.To enhance pose estimation accuracy, researchers have developed methods that refine the pose by analyzing the feature difference between the input image and the rendered image from a 3D model, such as the Deep Iterative Matching for 2D Template Matching (DeepIM) 5 approach.These methods leverage high-quality 3D models to generate detailed rendered images.However, acquiring such models can be costly, and the process generally requires significant computational resources.
These challenges significantly impact the speed, accuracy, and usefulness of pose estimation methods in achieving accurate manipulation in robotics.To address these limitations, our method focuses on enhancing pose estimation accuracy by leveraging stereovision techniques.We draw inspiration from stereovision to identify pixellevel precise feature point pairs through feature matching.Instead of relying solely on corners or key points on the object's surface, we employ scale-invariant feature transform (SIFT) 6 and fast library for Approximate Nearest Neighbors (FLANN) 7 algorithms for feature point extraction and matching to generate 2D key points.We then utilize these 2D key points and calculate their corresponding 3D points in multiview images for the PnP algorithm.This approach ensures high-precision 2D-3D correspondence and avoids introducing errors from the 3D model acquisition process.Our method adopts more robust patch matching than point matching to refine 2D key points used for the PnP algorithm.Inspired by Siamese neural network 8 and Siam-Fc, 9 we designed a matching refinement network (M_R_Net) to refine 2D points by using a template image patch.Rather than utilizing rendered images derived from a 3D model as demonstrated like DeepIM, our approach involves employing real images obtained from a database to generate template image patches.This technique allows us to circumvent the need for 3D model acquisition and directs our focus toward the incorporation of realistic textures present in the real images, thereby yielding improved matching accuracy.The database can be easily obtained by collecting multiview images.To improve the inference speed of our method, we designed a multiway trees-based search method to efficiently access the database so that generate template image patches.
In the initial stage, our approach involves constructing an offline database using real image patches.During the inference phase, our method begins by identifying image patches in the offline database that closely resemble the initial pose.Subsequently, we refine the 6 DOF pose by comparing and matching these image patches from the offline database with the sub-window extracted from the input image.The pose refinement process of our method is shown in Figure 1.Our approach operates in four stages: (a) according to the initial 6 DOF pose, the 3D points p i ðx; y; zÞ and template image patches B ij are found in the database, and more details are shown in the section" Find template image patch by initial 6 DOF pose"; (b) project the 3D points p i ðx; y; zÞ to the input image and obtain the target image patches centered on the 2D coordinate points obtained by the projection; (c) the target image patches and template image patches are passed to M_R_Net to output coordinates of 2D key points after refinement, and the details are shown in the section "Image patch matching"; (d) the PnP algorithm uses the 3D point p i ðx; y; zÞ and its refined 2D image coordinates to calculate the refined 6 DOF pose.Thanks to our efficient search module and fully convolutional network, our method achieves high inference speed.The M_R_Net exhibits robustness and pixel-level matching precision.Our experimental results validate the efficacy of our matching model in handling input disparities and maintaining a high level of accuracy, even in cases where the pose of the real image in the database is discrete due to uncertain perspective transformations.
More specifically, our contributions are as follows: (a) We propose a tiny fully convolutional Siamese neural network, which is fast, accurate, and robust enough to use for image patch matching.(b) We propose a 6 DOF pose refinement method, which does not require 3D mesh model, and is fast, accurate, versatile, and robust enough to be easily integrated with robotic systems in industrial environment, so that the robot can automatically select the appropriate grasping angle and grasping position according to the incoming material position posture in real time and generate a suitable motion path to avoid collisions and complete high-precision grasping and assembly tasks.
The organization of this paper is as follows: The second section reviews the current related work.The "Proposed method" section describes our method in detail, with the section "Find template image patch by initial 6 DOF pose" explaining how to find the template image patch through the initial 6D attitude."Image patch matching" section introduces the image patch matching phase, and section "6 DOF pose refinement" discusses the optimization process.In the fourth section, we conduct accuracy tests on a dedicated data set to verify the real-time performance, accuracy, and robustness of our proposed monocular 6 DOF pose estimation method on a robot platform.Lastly, the fifth section provides a comprehensive review and summary of the thesis, encompassing its contributions and future prospects.

Related work
It is an important task for many robot applications to locate the 6 DOF pose of objects only from RGB images.Recently, some studies have shown excellent performance in the field of RGB-based 6 DOF pose estimation by using deep learning technology.At present, most of the 6 DOF pose estimation method using deep learning based on RGB trains a deep learning model to establish the correspondence between 2D points on the image and 3D points on the object model and then uses PnP to restore the 6 DOF pose of the object, which is mainly divided into two categories: (1) establish 2D-3D correspondences for the predefined key points 10,11 and (2) Establish correspondences between all the pixels of the object on the image and the 3D points on the model. 12,13BB8 14 estimate the corners of the 3D bounding box and run PNP algorithm.Inspired by 2D object detector You Only Look Once(YOLOV2), 15 YOLO6D (a variation of the You Only Look Once algorithm designed for 6D object pose estimation) 16 trained an end-to-end neural network model to directly predict the projected points of 3D corners.CullNet (a neural network-based algorithm used for video compression) 17 introduced a culling module to further improve performance based on YOLO6D.Instead of regarding the corners as key points, PVNet (Pose VoxelNet) 18 places the key points on the object surface via the farthest point sampling algorithm and then regresses pixel-wise unit vectors pointing to the key points.However, since the PnP algorithm is sensitive to small errors in the 2D representation, it is still challenging to obtain high accuracy.
6 DOF pose refinement using deep learning: To further improve the accuracy of 6 DOF pose estimation, some works have proposed to further optimize the pose by training a pose refinement network.DeepIM trains a deep iterative matching neural network model and drives the generation of rendered images through the pose difference output by the model.The deep point-based object detector 19 method directly extracts the difference of features from the input image and the image rendered by initial pose and then estimates the optimized pose.Stevšic ´and Hilliges 20 introduce the spatial attention mechanism into the pose refinement to recognize and utilize spatial detail information during pose refinement.Their methods require a high-quality rendered image of a 3D model to compare the images.However, it is expensive to obtain highquality 3D model, and it will take more time to generate richly detailed rendered image.
Siamese neural network: The Siamese neural network measures the similarity between two inputs by sharing the weights of the two networks, which has great significance in the field of image matching.The traditional Siamese neural network accepts two inputs of the same size and then uses the distance between the two output vectors to represent the similarity of the two inputs.Siam-Fc combines the Siamese neural network and the fully convolutional neural network and improves the matching speed between image patches through parallel computing.

Proposed method
The overview of our method is depicted in Figure 1.In offline phase, we use real image to build an offline database.In the inference phase, our method first identifies the image patches within the offline database that closely resemble the initial pose, as outlined in Yang et al. 3 Subsequently, M_R_Net is utilized to refine the 6-degreeof-freedom (DOF) pose by aligning the image patches found in the offline database with the image patch extracted from the input image.The loss function used to train the network in this paper is calculated as follows where the loss xy represents the coordinate loss of the 3D bounding box.P represents the coordinate loss of the 3D bounding box.l obj is the existence parameter for the labeled 3D bounding box (label).If the current grid has a labeled 3D bounding box, l obj is 1; otherwise, it is 0. x j and y j represent the image coordinates of the nine keypoints of the predicted 3D bounding box, while x ˘ j and y ˘ j represent their corresponding ground truth values.The loss conf is the confidence loss, where a obj and a noobj are the loss weights for positive and negative samples, respectively.The specific parameter settings are provided in the section "6 DOF pose refinement' of the experimental part.c i and c ˘ i represent the predicted confidence and its ground truth for each grid.The loss cls is the loss for classification prediction, where CEðÞ denotes the cross-entropy loss.l i and l ˘ i represent the probability vectors of predicted classes and their corresponding ground truth vectors for each grid, and the loss denotes the total loss.

Find template image patch by initial 6 DOF pose
At the initial step, in offline phase, we use the QR code [30] placed on certain location around the object to establish the object coordinate system, then use N real images collected under discrete viewpoints to build an offline database, and estimate the pose T j ¼ ½Rjt between camera and object for each image according to QR code.We use some image pairs from the offline database to calculate m 3D points p i ðx; y; zÞ in object coordinate systems according to feature point extraction SIFT and matching FLANN, as shown in Figure 2. Then we establish the correspondence between the Euler angles q j ða; b; gÞ zÀyÀx of the rotation matrix in the pose T j and fB ij ; p i ðx; y; zÞg 1 i m fq j ða; b; gÞ zÀyÀx $ B ij ; p i ðx; y; zÞg where i and j are the subscripts of the m 3D points and the N real images, respectively, and B ij is the template image patch from the real image, centered on the 2D projection point of p i ðx; y; zÞ by T j .In inference phase, for a given initial 6 DOF pose T init between camera and object, we need to efficiently find the Euler angle q target closest to q init in the discrete set fq j ða; b; gÞ zÀyÀx g 1 j N ; therefore, we use a multiway tree structure to speed up the search.We divide the ranges of a 2 ðÀ180; 180; b 2 ðÀ90; 90 and g 2 ðÀ180; 180 into r, s, and t intervals, respectively, and build a multiway tree, as shown in Figure 3.Our search process is divided into three stages: (a) get the Euler angle q ða; b; gÞ zÀyÀx of the initial 6 DOF pose.(b) HðxÞ ¼ xþD k , with x ¼ a, b, g, and k ¼ 360 r ; 180 s ; 360 t and D ¼ 180, 90, 180.Select the branch of the multiway tree according to H(x), and we will get a set of size n with possible outputs.(c) Find one among all possible outputs that minimizes kq À qk i 1<i<n .The final output contains fp i ðx; y; zÞg 1 i m and template image patches fB ij g 1 i m .

Image patch matching
At the second step, we will use image patch matching to re-fine the initial 2D coordinates u init i in the input image.u init i is the 2D projection point of 3D point p i ðx; y; zÞ according to T j ¼ ½Rjt using  where l is a scale factor, R and t are the rotation matrix and translation vector that define the camera pose, and K is the camera intrinsic matrix. 21e designed a tiny fully convolutional image patch matching network named M_R_Net to find the refined position u refined i of u init i (shown in Figure 4).In M_R_Net architecture, image patches x and z are passed to network to get the output, which is a map of matching score, the Convolutional Neural Network (CNN) blocks in the branch of x and z are shared weights, the output of the CNN block undergoes a layer of 2D convolution to obtain the final output, the position with the highest score in the matching score map represents the best matching position of z in x.
Because each image of the offline database is collected under the certain pose, in the inference stage, the distance dx between camera and object may be greater or less than the distance dz between camera and object of offline stage, which is the process of offline database image collected.Therefore, before passing M_R_Net, z needs to be normalized, specifically, zoom in or zoom out by dx/dz.Pass x and z through M_R_Net and we will get the pixel coordinates of z on x.We use this pixel coordinates of the matching position of z on x.We use this pixel coordinate as the refined position u refined i .Similar to correlation filter networks, 22 we use a training loss to calculate the gap between the matching position of the template image patch z on the target image patch x and ground truth.We train positive and negative pairs use a logistic loss where v is a value on the output matching score map O, and O is a (MÀNþ1) Â (MÀNþ1) Â 1 tensor in Figure 4, y 2 fþ1; À1g is ground-truth label.We define the total loss to be the mean of the individual losses for all matching positions The elements of the score map are considered to be positive if they are within radius R of the ground truth position c of the map.

DOF pose refinement
At the third step, we use the PnP algorithm to calculate the refined pose.The 3D points p i ðx; y; zÞ in the object coordinate system are projected into the input image according to the initial 6 DOF pose T init to obtain u init i , because of the inaccurate initial pose, u init i deviates from the ground truth position u gt i .We use image patch matching in section "Image patch matching" to reduce this offset to get more accurate position u refined i .Then we use the PnP algorithm to get the refined 6 DOF pose based on the correspondence between 2D points and 3D points The visualization of the pose refinement process is shown in Figure 5, where the orange, green, and blue 3D bounding box represent the initial, ground truth, and refined 6 DOF pose, respectively.When the refined box overlaps with the ground truth box, only the refined box is displayed.The orange, green, and blue image points represent u init i , u gt i , and u refined i , respectively.When the green and blue image points overlap, only the green image points are displayed.Given an initial pose T init (shown in top row), 3D points p i ðx; y; zÞ and template image patches B ij can be obtained from section "Find template image patch by initial 6 DOF pose."Image patch x from input image,

Implementation details
We captured images and built our data set using a color industrial camera (MER2-160-75GC-P) with a resolution of 1292 Â 964 and a focal length of 6 mm.In our data set, we have six objects depicted in Figure 6.Around each object, we placed QR codes 23 to establish an object coordinate system.The camera was mounted on a robotic arm, and we collected 1700 images discretely and uniformly in the upper half of the camera's viewpoint space by setting robot motion trajectories.We estimated the camera pose for each image based on the QR codes and then rotated the plane to expand the data set to 170,000 images.For each object, we marked an initial 3D bounding box in the object coordinate system to generate ground truth for all the data.
Search module.We use all the 170,000 images to build the offline database for each object, and set r, s, and t to 60, 45, and 60, respectively, in the section "Find template image patch by initial 6 DOF pose." Matching module.We only train a patch matching model for all objects.There are 6000 image patch pairs in the training data set, and the image patch pairs consists of two resolutions (64 Â 64, 32 Â 32) image patches, and the ground truth matching position is the center of image patch.Image patch with resolution 64 Â 64 and 32 Â 32 will as x and z feed into MÀRÀNet, respectively.
We train our model on a GeForce RTX Titan GPU and use the stochastic gradient descent optimizer.We set loss weight of positive and negative to 1 and 2, respectively.We train our model for 10 epochs with learning rate 1 Â 10 À2 , and with momentum 9 Â 10 À1 .We set batch size to 16 and set radius R of the training loss to 1.In the training phase, we randomly adjust the hue, saturation, and brightness for the z and x.

Evaluation metric
In our experiment, we use average distance of dots (ADD) 24 as evaluation metrics ADD computes the average distances between the 3D points in the object coordinate system transformed using the estimated 6 DOF pose and the ground truth 6 DOF pose.Generally, if the average distance is less than a threshold, the estimated pose is considered correct.In our experiment, we set the thresholds to 0.1d, 0.01d, 0.005d, and 0.0025d, where d represents the diagonal length of the 3D bounding box.The symbol d serves as a scaling factor to compare the average distance with the threshold, thus determining the accuracy of the estimated pose.This symbol allows for adjusting the threshold based on specific application requirements.By setting the threshold size, we can evaluate the accuracy of pose estimation in different objects and scenes, and smaller distance values correspond to better estimation accuracy.For the Cuboid, Car, Cylinder, Beer, Crest, and Hitiger, the thresholds were set to 133.7 mm, 172.3 mm, 113.3 mm, 180.5 mm, 206.1 mm, and 210.3 mm, respectively.

Pose refinement results
For training of YOLO6D, CullNet, and PVNet, we randomly selected 10% from the data set in the section "Implementation details."Our method adopts YOLO6D as the initial pose estimator.For evaluation, we select 500 images for each object as test data from the remaining 90% of the data set.
We calculate ADD on test data and shown in Table 1, and the values presented in the table indicate the percentage of images in which the estimated pose is deemed correct within a specific threshold.This method achieves high accuracy in pose estimation, as demonstrated by the obtained results.The values in the table represent the percentage of images in which the estimated pose is deemed correct, based on a specific threshold (smaller distance values correspond to better estimation accuracy).This accuracy can be attributed to the method's capability to achieve high precision.In addition, increasing the image resolution directly enhances the accuracy of our method.
We only change the resolution of image patch in our method and calculate the ADD on test data, and the results are shown in Table 2.The accuracy of our patch matching network mainly depends on image patch z, and the lower resolution of patch z means that fewer features are used for matching, resulting in lower accuracy.
We change the resolution of the input image and calculate the ADD on test data, and the results are shown in Table 3.We reduce the image resolution from 1292 Â 964 to 640 Â 480 through cubic spline interpolation, while adjusting the camera's intrinsic matrix.Since our patch matching model always outputs pixel-level results, reducing the resolution of the input image will reduce the accuracy of our method.However, the camera resolution of robotic platform is usually no lower than 640 Â 480 in industrial environment.
We change the distance between camera and object, and experiment at three distances 0.6 m, 0.8 m, and 1.0 m, respectively.For each distance, we collect the same number of images for data set building and select the same number of images for test, and same number of images to train the YOLO6D model.We calculate the ADD scores on the test data for each distance, as shown in Table 4, where the results of 0.6 m are from Table 1.The visualization of results is shown in Figure 7.As the distance between the object and the camera increases, the pixel area of the object in the image will decrease, and one pixel corresponds to more object region.The effect of the matching error of our patch matching network on the pose error will be amplified, and the accuracy of our method will decrease.However, the distance between the object and the camera is usually not fixed in practical robotics applications, and we can move the object or the camera to obtain a suitable distance between camera and object, so that reduce the abovementioned error and obtain high pose estimation accuracy.

Analysis of the matching module
The performance of our 6D pose refinement method depends on the robustness of our image patch matching model.As shown in Figure 8(a), we fix the input image patch z, and randomly select 200 locations (orange points) as the center to generate the image patch x.We feed (x i , z) 1 i 200 into our image patch matching model to calculate DP, where DP is the average pixel distance between the blue matching position output by our matching model and the green ground truth.As shown in Figure 8(c), we select 100 image patches z with different ground truth 6 DOF pose for each fixed image patch x to compose image patch pairs (x i , z) 1 i 100 , then group the (x, z i ) according to Euclidean distance Dd of Euler angles between the pose of z i and the pose of x, and we calculate the average pixel distance DP between the blue matching position output by our matching model and the green ground truth for each group.Even if the 6 DOF pose gap between the image patch x and the image patch z i increases, our matching model still shows high performance.Even if the poses of the real images in our offline database are discrete, our matching module can still get pixel-level matching accuracy.As shown in Figure 8(b), only the green point is  x:(64,64)z: (24,24)  x:(48,48)z: (24,24)  x:(32,32)z: (16,16)   0.01d 0.005d 0.0025d 0.01d 0.005d 0.0025d 0.01d 0.005d 0.0025d 0.01d 0.005d 0.0025d

Yang et al.
displayed when overlapping between blue point output by our matching model and green ground truth matching position.We randomly change the Hue, Saturation, Value (HSV) of image patch x and fix the z, and the matching result shows that our matching model is not sensitive to the fluctuation of ambient light.As shown in Table 5, we compared our M_R_Net with the method implemented in OpenCV, witch sum of squared distance and normalized cross correlation.We use 1500 image patch pairs with (64 Â 64, 32 Â 32) resolution to calculate the matching accuracy of three methods, and for all image patch pairs, the patch center is ground truth matching position.We scale, change HSV, rotate, and perspective transform each image patch pair (x, z) to calculate the matching position, respectively, where (0, 10%], (0, 50%] and (0, 50%] random disturbance of H, S, and V for both x and z, respectively, and [0.8, 1.2] random scale for z, [À30 deg, 30 deg] random rotation for z, (0, 0.08) random perspective for z using the Python library imgaug.
The matching result is considered correct when the difference between the output matching position and ground truth within one pixel (include end), and the values in the table are the percentage of correct matching results.Experimental results show that our M_R_Net has higher robustness.

Running time
Our experiments are run on a desktop with an Intel i7-3759700KF 3.60 GHz CPU and a NVIDIA GeForce RTX     3).

Matching module.
The running time of the matching module depends on the batch-size and the resolution of the input image patch.As shown in Table 6, we change the batchsize (Bs) and the resolution (R-) of the input image patch to test our matching module.For each combination of batchsize and resolution, we calculate the average time of 1000 inferences.
When considering both initial pose estimation and pose refinement, we set the resolution of input image to 640 Â 480, set 8 batch-size and set the resolution of x and z to 32 Â 32 and 16 Â 16, respectively.Then compare the FPS with YOLO6D and PVNet in Table 7.

Robotic evaluation
To evaluate the practicability of the proposed method, we apply it on the robotic manipulation platform to evaluate accuracy.We set up the eye-to-hand platform comprising an AUBO I5 robot, a 1.0 mm diameter Phillips screwdriver, and an industrial RGB camera, as shown in Figure 9 left, and set up the eye-in-hand platform comprising an AUBO I5 robot, an industrial RGB camera and a calibration board shown in Figure 9 right.Eye-to-hand.The evaluation process is shown in Figure 10.First, we calibrate the camera-robot system.Second, we select a 3D point A in the object coordinate system.Third, we use the QR code to calculate the pose of the object, move the tip of the Phillips screwdriver of the robot to point A, and then record the current robot end pose T1, shown in Figure 10 left column.Fourth, we use the proposed method to estimate the pose of the object and also move the tip of the Phillips screwdriver of the robot to point A, and then record the current robot end pose T2, as shown in Figure 10 middle column and right column.We use T1 and T2 to calculate the poses TP1 and TP2 of the Phillips screwdriver tip in the robot coordinate system, respectively.Then we calculate the difference between the six components of TP1 and TP2, as shown in Table 8.Therefore, the accuracy of the pose estimation method in this paper is verified.The method in this paper can obtain high pose estimation accuracy only from a single RGB image, which has high practical value for low-cost and high-precision robot scenes.
Eye-in-hand.Similar to fast directional chamfer matching (FDCM), 25 we evaluate the consistency of pose estimate across multiple viewpoints of the camera.More specifically, Table 6.Running time (ms) of matching module.

R-Bs
x:(32,32) z: (16,16)   x:(48,48) z: (24,24)   x:(64,64) z: (24,24)   x:(64,64) z: (32,32)   x: (    we placed the object and calibration board in a complex environment and kept the object, calibration board, and base coordinate system of robot relatively fixed throughout the experiment.Then we command the robot arm to move to several rotations and translations, so that the image can be collected under various camera viewpoints.For each collected image, we use YOLO6D model estimate the initial pose and our method to refinement, and the visual result of the optimized pose is shown in Figure 11.Since the pose our method output represents the pose between the camera and the object, we transform all the poses output by our method to the base coordinate system of robot.Since the object is static, the estimated pose of the object in the base coordinate system of robot should be identical irrespective of the viewpoint of the camera.For each target in Figure 6, we collect 20 images, respectively, then transform the 20 poses to base coordinate system of robot, and for each pose, we calculate the deviations from the median pose of 20 poses, shown in Figure 12, where Dr x , Dr y , Dr z , Dt x , Dt y , and Dt z represent the six components of the pose, respectively.For each subfigure, the horizontal axis represents the deviation from the median pose, and the vertical axis represents the proportion of the number of poses whose deviations are in a certain interval, such as [0.0, 0.1), [0.1, 0.2).
Nut and bolt assembly experiments.We use the proposed method to estimate the pose of the bolt and nut relative to the camera, as shown in the left column of Figure 13, and then use the end actuator of the robot with suction to lift the nut.To eliminate the pose disturbance caused by the nut suction process, we use the proposed method to estimate the pose of the nut on the robot relative to the camera, as shown in the second column of Figure 13.Then, according to the estimated pose between the nut and the camera and the robot's hand eye calibration results, the relative pose between the nut and the robot's end is obtained.At the same time, according to the estimated pose between the bolt and the camera and the robot's hand eye calibration results, the pose of the nut required to complete the assembly relative to the robot's base coordinate system is obtained, and finally the pose transformation of the robot's end required to complete the assembly under the robot's base coordinate system is obtained.Thus, the end of the robot is controlled to complete the assembly, as shown in the right column of Figure 13.To evaluate the assembly accuracy, we fix the relative position and posture between the bolt and the robot base coordinate system, and change the relative position and posture between the nut and bolt during each experiment.Each line in Figure 13 represents an assembly experiment.Because the bolt is fixed relative to the robot, in an ideal situation, the position and pose of the assembled nut in the robot base coordinate system should be the same.Due to the existence of the position and pose estimation error, the relative position and pose between the nut and the robot base frame are different each time the assembly is completed.This paper calculates the position and pose between the nut and the robot base frame when the assembly is completed for 20 times and draws the deviation histogram of each component with respect to the median value.Evaluate the assembly accuracy, as shown in Table 9, where the meanings of Dr x , Dr y , Dr z , Dt x , Dt y , and Dt z are the same as those in Figure 12.
Experiments show that our method can obtain high pose estimation accuracy, even if in complex backgrounds, our method still performance well, which makes our method suitable for practical robotic applications.

Conclusion
In this paper, we propose a pose refinement method based on real image patch matching.Compared with some 3D model needed methods, [26][27][28] our method does not need the 3D object mesh models and solves the difficulty of generating real-world 3D models with high-quality textures.Meanwhile, based on our efficient template search method and fully convolutional patch M_R_Net, our method achieves real-time and high-precision.According to the actual experiment, our 6 DOF pose refinement method achieves high speed (71FPS) and accuracy (ADD score higher than 70% on the test data when the threshold is   0.5% of object diameter).The experiments of grasping and assembling tasks on industrial robots show that this method can automatically select the appropriate grasping angle and position in real time and generate the appropriate motion path to avoid collision, thus ensuring production efficiency.However, in the process of image patch matching, there will be some false positive matching due to the existence of some very similar objects.In the future, we can further improve the system by adding a validation step that evaluates the quality and accuracy of the matched feature points, including assessing the geometric consistency of the matched features, and using outlier exclusion techniques to eliminate any mismatches.

Figure 1 .
Figure 1.The process of algorithm: (a) find template image block, (b) project 3D points, (c) image block matching, and (d) compute refined pose.

Figure 6 .
Figure 6.Objects in our data set.

Figure 7 .
Figure 7. Pose refinement visualization on test data.

Figure 8 .
Figure 8. Robustness of image patch matching model.

Figure 10 .
Figure 10.The evaluation process visualization of eye-to-hand.

Figure 11 .
Figure 11.The pose estimation results visualization of eye-in-hand.

Figure 12 .
Figure 12.Histograms of deviations from the median pose.

Table 9 .
Statistical histograms of 6 DOF pose estimation.6 DOF pose component Proportion Deviation (deg/mm) Dr x Dr y Dr z Dt x Dt y Dt z of freedom.

Figure 13 .
Figure 13.The process of robot assembly.

Table 1 .
ADD scores for various thresholds on our test data.
ADD: average distance of dots.

Table 2 .
ADD scores for various resolutions of image patch.

Table 3 .
ADD scores for various resolutions of input image.
ADD: average distance of dots.

Table 4 .
ADD scores for various distance on the test data.
ADD: average distance of dots.

Table 5 .
Matching refinement network.depends on the search module and the matching module.Search module.The time complexity of search is O(k), and the search operation's time grows linearly with the number of elements k in the search set.This time complexity characterizes the computational resources required by the search algorithm in the worst-case scenario.k is the average size of all subsets of the offline database determined by r, s, and t (shown in Figure

Table 7 .
Running time of pose estimation.

Table 8 .
Evaluation results of the three objects.