Robotic object recognition and grasping with a natural background

In this article, a novel, efficient grasp synthesis method is introduced that can be used for closed-loop robotic grasping. Using only a single monocular camera, the proposed approach can detect contour information from an image in real time and then determine the precise position of an object to be grasped by matching its contour with a given template. This approach is much lighter than the currently prevailing methods, especially vision-based deep-learning techniques, in that it requires no prior training. With the use of the state-of-the-art techniques of edge detection, superpixel segmentation, and shape matching, our visual servoing method does not rely on accurate camera calibration or position control and is able to adapt to dynamic environments. Experiments show that the approach provides high levels of compliance, performance, and robustness under diverse experiment environments.


Introduction
An aging population and rising labor costs are acute challenges facing society, resulting in a high demand for indoor service robots. Service robots working in indoor environments, such as homes or offices, often need to handle a variety of grasping tasks that require the ability to recognize the target object in a complex or dynamic background environment. 1,2 Due to uncertain factors, such as illumination, occlusion, and object posture, as well as the challenges of executing a real-time response by indoor service robots and choosing the proper gripping positions, it is difficult to design a lightweight recognition algorithm that can define different target objects whenever necessary.
Research on robotic grasping has resulted in many different grasping methods. [3][4][5][6] Recently, deep-learning techniques have emerged as the most preferred methods among the approaches in the field of grasp synthesis. 7,8 These methods use various versions of convolutional neural networks (CNNs) to identify the objects to be grasped, 9,10 which means they demand a large amount of data as well as time for training and testing; the approaches also require an expensive hardware environment. However, the results of these methods often include problems with overfitting and lack reasonable generalization ability and the ability to be well-interpreted. Therefore, the methods based on deeplearning technology are difficult to apply to indoor robotic grasp tasks with variable target objects, viewing angles, and a dynamic environment.
In this article, a novel, fast, and lightweight method is proposed for robotic object recognition and grasping tasks.
The method can extract the contour information of objects contained in an image using edge detection and superpixel segmentation techniques and calculate the similarity between the two contours with a shape descriptor technique to complete the object recognition. Then, using the relative distance between the object centroid and the gripper, the algorithm guides the robot to move the gripper to the object and form a proper grabbing posture to complete the grasping task.
When compared with the prevailing deep-learning methods, our approach has the following advantages. First, it can flexibly adapt to the variable positions and postures of the target objects and the changes in the environment because the recognition method is based on the shape information of the objects, which is a stable, long-lasting, and essential feature. Second, since the object is identified by shape features, this method does not require a large number of training samples, which means that it saves cumbersome manual labeling work and greatly lowers the requirements for a computing hardware. Third, this method combines the object recognition module with the robot control module to form a hand-eye coordination mechanism with feedback, that is, a closed-loop control process. This process means it is not necessary to calculate the exact absolute coordinate values, only the relative positional offset between the object and the gripper, which greatly simplifies the calculation of the conversion between multiple coordinate systems. This process also improves the robot's adaptability to the environment and the response speed to the tasks. Fourth, this method is highly interpretable. The human visual system recognizes objects mainly based on the contour information, 11,12 which means our method utilizes the results of cognitive research.

Related work
Robotic grasping is a widely studied topic. Generally, these techniques can be grouped into two categories: empirical methods and analytic methods. Analytic methods 3,13 use mathematical and physical models of geometry, kinematics, and dynamics to calculate stable grasping strategies. However, such methods are not easily applied to real-world scenarios, since it is difficult to model the physical interaction between the gripper and the object. Empirical methods [14][15][16] tend to avoid the computation of physical or mathematical models that mimic human grasping strategies. These techniques associate the appropriate grasp points with a database storing object model or shape information based on object type definitions.
Recently, techniques based on deep learning have become popular. 8,10,[17][18][19] The strategies are similar: A certain number of grasp candidates are extracted from the image or point cloud, and the algorithm ranks them with a CNN and considers the object of the highest score the one to be grasped. Once the object is identified, the robot performs an open-loop grasp, which requires the precise calibration between the camera and the gripper, as well as a completely static grasping environment. These methods require a large number of labeled samples for training and testing, which not only imposes high requirements on the hardware environment but also requires a lot of manpower and time. The models usually lack reasonable generalization ability. However, CNNs often contain millions of parameters and rank grasp candidates with a sliding window at discrete intervals of offset and rotation, which results in a long processing time that can be up to tens of seconds. Deep-learning methods often only achieve coarse positioning with bounding boxes, which is not enough for precise grasping tasks.
The approach proposed in this article identifies an object based on the shape information, can quickly complete the pixel-level recognition tasks, and does not require a pretraining process or expensive hardware environment. In addition, this method is highly adaptable to changes in the environment as well as the position and posture of the object because of the shape-based recognition algorithm. Instead of bounding boxes, the recognition result is the pixel-level outline of the object, which is more conducive to the following grasping tasks.

Shape-based object representation method
In the early work of one of the authors, multiscale triangular centroid distance (MTCD) descriptors were proposed to represent shapes. 20,21,[22][23][24][25] MTCD descriptors can be adapted to translation, scale, rotation, and deformation. In addition, it is convenient and quick to calculate the difference between shapes represented by MTCD descriptors, so here we used this method to calculate the similarity between two contours.
Given a shape S, let sequence P x i ; y i ð Þ; i ¼ ð 1; 2; . . . ; N Þ denote the sample points of its outer contour, where (x i , y i ) are the coordinates of the sample points and N is the number of the sample points ( Figure 1). We calculate the centroid point G x G ; y G ð Þof S Given a certain point P x i ; y i ð Þ, we can find other points P j x i ; y i ð Þ, (i 6 ¼ j) of S. A triangle DP i P j G can be formed with P i , P j and the centroid point G. As shown in Figure 2, the coordinates of the centroid point G ij x G ij ; y G ij of DP i P j G can be calculated as follows T triangles can be obtained for each sample point P i , where T represents the scale number, which is set to ð6Þ Next, to obtain invariance to the starting point of our shape descriptor, Fourier transform is applied to each row of M and the phase information is discarded. For easy explanation, let r t denote each row of M. Then, the discrete Fourier transform for r t can be calculated as It is not difficult to prove that abs F t i ð Þ ð Þis invariant to the starting point of the contour of S. Furthermore, to improve the efficiency and effectiveness of the following shape matching, the dimensionality of M is reduced from N to Q, where Q << N . Thus, the final definition of our shape descriptor is Here, we set Q ¼ 16 in all our experiments, see Figure 3.
Given two shapes S 1 and S 2 , whose shape descriptors are  and . . . ; Q È É , respectively. Then, the measure of dissimilarity between S 1 and S 2 can be obtained by the L 1 distance The smaller the dissimilarity D S 1 ; S 2 ð Þis, the more similar the two shapes are.

Elimination of redundant information in images
To obtain the line information in the image, we referred to Dollár and Zitnick's work 26 on edge detection and the work of Radhakrishna et al. 27 on superpixel segmentation. In one study, 26 the use of the structured forest technique for edge detection achieved good results.
However, since real images often contain a lot of noise, the results of edge detection involve much redundant information for object recognition tasks. On the other hand, superpixel segmentation, which groups pixels into perceptually meaningful atomic regions, can effectively eliminate the effects of noise in the image.
The superpixel segmentation algorithm we apply is easy to understand, and it requires only one parameter provided by users, that is, k, which is the desired number of the superpixels.
Given a color image in Commission Internationale de l'Eclairage Lab (CIELAB) color space, the algorithm firstly divides the N pixels into a regular grid, whose interval is then sampled from the grid. To avoid centering a superpixel on an edge and seeding a superpixel with a noisy pixel, the centers are moved to positions corresponding to the lowest gradient in a 3 Â 3 neighborhood. Next, we traverse all the centers and associate each pixel with the nearest cluster center whose search region overlaps its position. The algorithm searches a limited region, the size of which is set to 2S Â 2S in our experiments. Given a pixel P i ¼ l i ; a i ; b i ; x i ; y i ½ T , we can obtain its color distance d c and space distance d s from a certain cluster Then, we normalize the spatial proximity and the color proximity by their respective maximum distances within a cluster, that is, N S and N C , to combine d c and d s into a single measure of distances between a pixel and a cluster center. Let D denote this measure, whose definition is as follows which we set to 10 in our experiments, weighs the relative importance between spatial proximity and color similarity. The cluster centers are then adjusted to the mean vector v ¼ l; a; b; x; y Â Ã T of all the pixels within the cluster. A residual error E is calculated between the new cluster center and the previous one by using L 2 normalization. The clustering and adjusting steps will be repeated iteratively until E converges, but we found that 10 iterations perform good enough in all our experiments. Finally, we traverse all the pixels and assign the disjoint pixels to their closest superpixels to enforce connectivity. The segmentation result is shown in Figure 5.
However, superpixel segmentation also results in a considerable amount of edge information for the superpixel blocks, which is also redundant information for object recognition. The approach proposed in this article combines the results of edge detection, shown in Figure 4, and superpixel segmentation to extract the true contour information of the objects contained in the image. Figure 6 shows this algorithm in action. Assume that the input image is I, of which the size is c Â r. I ed is obtained by performing edge detection on I, and I sp is obtained by performing superpixel segmentation on I, where both I ed and I sp are grayscale images whose gray values are [0, 1]. The larger the gray value of a pixel, the greater the probability that it belongs to an edge. For a pixel p in I ed , assume its coordinate in the image is x; y ð Þ, (1 x c, 1 y r), and its gray value is g, (0 g 1), corresponding to the pixel p 0 in I sp , whose coordinate is x; y ð Þ, and the gray value is g 0 . If gþg 0 ð Þ 2 > t, it is considered that the pixel whose coordinate is x; y ð Þ is more likely to be the actual contour of some object, where t is the set threshold. Thus, the value for I C is obtained that contains the possible contour information of the objects involved in the original image I, where I C is a binary map, wherein a pixel value of 1 indicates that the pixel is a contour pixel, and a pixel value of 0 indicates that the pixel is a background pixel.
Next, we refine the contour lines in I C , that is, reduce the width of lines to one pixel and then take the branch points, whose number of adjacent points is more than two pixels, as the line end points to extract the contour line information C, which is a set of lines, where a line is a set of coordinate values of pixels.

Extraction of contour template
To obtain the shape contour of a given object, we photograph the object with a relatively monotonous background and remove redundant information of the image captured using the method above. As is shown in Figure 8, the image of animal models with a white background captured by a camera is processed by edge detection and superpixel segmentation, respectively. Then, the remaining pixels are filtered, and the lines are refined so that the shape contour of the models is obtained. The shape contour of a given object obtained above can be used as contour template in the following process, as shown in Figure 7.

Line segments combination based on heuristic search
Heuristic search, also known as informed search, reduces the search scope and complexity of the problem to be solved by referring to the heuristic information. The objective of heuristic search is to produce a solution in a reasonable time frame that is good enough for solving the problem. Heuristic search can avoid combinatorial explosions by guiding search to the most promising direction using heuristic information. The stronger the heuristic information, the less the search branches. The function used to evaluate the importance of search nodes is called valuation function, which is generally in the form of: where g x ð Þ denotes the actual cost from the initial node to node x, and h(x) denotes the estimated cost of the optimal path from node x to the target node. Heuristic information is mainly included in h(x) and is determined according to the characteristics of the problem.
Given an image, after the preprocessing described above of eliminating the redundant information, we can turn the image to a set of line segments which includes the contour information. If we traverse all the combinations of the line segments in the set, then the combination of line segments that is most similar to the shape contour of a given object can always be found, which is obviously very timeconsuming. By using the shape descriptor and shape dissimilarity we introduced above, we can guide the search path to avoid unnecessary search nodes and thus greatly improving search efficiency.
Given the contour template M of the object to be grasped, M is a binary map where pixels have a value of 1, indicating they belong to contours. As shown in Figure 9, the algorithm looks for a seed line C s in C as the starting  state for the following search, using the heuristic search strategy. C s should have a certain length because the short lines correspond to very few sampling points, which result in finding too many similar parts on the template. C s should also have a certain degree of curvature, since a real scene in images always involves a lot of line segments that tend to be mismatched with the line segment of the template. With these restrictions, the search domain for C s is greatly reduced, and C s , which indicates the line most similar to some part of the object to be grasped, is searched exhaustively in the remaining line set C 0 . Specifically, for a line c length of l in C 0 , the algorithm slides a window with size of l onto the contour template M. In each iteration, the line c t in the window is taken out to calculate the shape similarity s with c. The maximum value of s is considered to be the similarity between c and M, and the corresponding part of M is recorded as c m .
After all the lines in C 0 have been calculated in regards to their similarity with M, the algorithm takes the c with the largest s as the seed line C s , corresponding to the contour line C m on M.
After C s is found, the following searches only consider the line set C nei connected to the ends of C s , For each iteration, a line c nei 2 C nei is selected to combine with C s to form a new line C 0 s , and C 0 m is obtained by extending the corresponding length in the same direction of C m on M, and the similarity between C 0 s and C 0 m is calculated. At the end of each iteration, the C 0 s most similar to M is taken as the result of this round and set as the seed line C s . The search proceeds according to this rule until C s becomes a closed line and the template match is completed, as shown in Figure 10.  Finally, the centroid P of C s is calculated to guide the robot to execute the grasping task. Suppose

Determination of the grasp position with contour detection
After the recognition task is completed, the shape contour . . . ; x n ; y n ð Þ f g of the object has been obtained, let P x; y ð Þ denotes the centroid point of the shape contour, where Since the gripper is open to a certain extent, an appropriate gripping position is needed to guide the robot to rotate the sixth joint to form a proper grip posture to execute the grasp task. We took into consideration the general size of the gripper mounted on the robotic arms and the irregularity of the shape of the object to be grasped. The proposed approach uses the relatively narrow concave portion of the object outline as the grasping position so that it can handle different situations, such as when the gripper is too small to grasp a relatively large object and make the grasping state as strong as possible.
The result of the recognition method discussed in this article, a precise outline of the contour of the object, indicates that the gripping position that meets the requirements above can be obtained in a simple way. First, make a straight line l passing through the centroid C of the object outline and measure the width of the contour using the Euclidean distance d between the intersections p 1 and p 2 of l and contour C s . Rotate l at a certain angle interval q, and obtain sets of intersection points and corresponding distance values. After a rotation of 180 , the set of intersection points with the smallest distance value is taken as the final grasping position.
As shown in Figure 11, the normal line between the two clips of the gripper is initially collinear with the horizontal axis. After the appropriate grasping position is calculated, it is only necessary to control the gripper to rotate q degrees counterclockwise to form the optimal grasping posture.
Guiding the robot to approach the object to be grasped using the recognition results

Interaction between the recognition module and the robot control module
As is shown in Figure 12, the computer is connected with the control cabinet of the robotic arm through a twisted pair. The  world coordinates (the origin is in the center of the base of the robotic arm) the gripper to be moved to are transferred to operating system of the robotic arm using transmission control protocol protocol. The coordinate data are then transformed to the rotation angles of each of the six joints.
As shown in Figure 13, the images captured by the camera are transmitted to the computer; the latter takes the images and the appropriate template image as parameters and invokes the recognition module to execute the recognition tasks. Next, the relative position of the object centroid to the center of the camera is calculated. If the centroid is in the central area of the camera field of view, the control module moves the gripper down to grasp the object. Otherwise, it moves the gripper to the centroid position of the object according to the relative position and makes the camera capture an image.

Object recognition flowchart
The flowchart in Figure 14 corresponds to Algorithms 1 and 2.

Robot control module flow chart
As shown in Figure 15, the robot control module accesses the memory location storing the object recognition result at  If the centroid is in the center, control the gripper to move down for grasping, otherwise move it closer to the centroid of the object to be grasped.

Experiments
Our environment Figure 16 shows our experiment environment, which consists of a computer connected to the robot control cabinet, an SD700E industrial robot arm (yellow) with six degrees of freedom, +0.03 mm repeatability, and 700 mm radius. An EFG20 electric gripper (silver white) was mounted on the end flange of the arm, above which a simple camera (Logitech C310) was mounted.
The total cost of the hardware set was less than US$15,000. The surface of the workbench was set up to be complicated and messy on purpose. The state of the table was changed during the experiments, such as when we disturbed the relative order of the objects and changed their postures and positions, to highlight the robustness of our approach. Figures 17 to 19 show the application of our method in the real environment. Each row in Figure 18 shows the relative positions between the gripper and the object to be grasped, that is, rhino model, before and after the recognition process was done. To present the robustness of our method, each time when the gripper was moved above the object referring to the recognition

Start
The camera mounted on the end of the robotic arm captures image , which will be transmitted to the recognition module for processing Is there feedback from the recognition module?
Is the centroid of the object in the center of the camera's field of view?
Move the gripper down to grasp the object  result, the object was moved to a different position with a different posture as well as the surrounding objects, which resulted in another round of recognition process. Once the state of the object was not changed after the gripper was moved above it, the gripper would be moved down to execute the grasp, as is shown in the last row. Figure 18 shows the details of recognition process corresponding to different positions of the gripper displayed in Figure 17. From left to right on each row, the image captured Figure 17. The left side of each row shows the initial position of the gripper and the object to be grasped, that is, rhino model; the right side shows the position the gripper was moved to once the object was recognized. As the first four rows show, the object was moved to different positions and placed in different postures once the gripper was moved above it, and the state of the surrounding objects was also changed (as is shown in third row). The last row shows that when we did not change the state of the table after the gripper was moved above the object, the gripper was then moved down to execute the grasp task.
by camera mounted on the gripper, the line segments representing the contour information extracted from the image and the recognition result are displayed respectively. The red line in the right side of each row connecting the centers of the camera view and the object to be grasped shows the relative position between the gripper and the object, which can be used to guide the robotic arm to move the gripper above the object.
After the precise contour of the object to be grasped is obtained, the appropriate grasp point can be calculated in a simple way to guide the robot to perform grasp tasks, as is shown in Figure 19.
We are unable to include enough pictures to show the entire recognition and grasp process owing to space constraints. However, a video clip is provided to show the whole experiment process.

Discussion
Although the object recognition methods based on deep learning are outstanding for classification tasks, they can only generate bounding boxes containing the object to be grasped when guiding a robot to perform grasp tasks. In addition, deep-learning methods require a lot of training and test data and computing time as well as an extremely extensive hardware environment. Therefore, these techniques are not suitable for robot grasp tasks, especially those that should be defined temporarily.
The object recognition module required for robot grasp tasks should be sufficiently lightweight and fast, while still being able to handle the noisy environment because the environment affects the motion planning of the robot. The recognition proposed in this article starts with the shape of the object and extracts the contour information from the original image which is stored in the form of lines. In addition, the heuristic search strategy greatly reduces the search domain so that the object can be effectively identified from the chaotic environment, which makes our method fast and robust enough to be more suitable for various robot applications.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by NSFC under Project 61771146 and Project 61375122.

Supplemental material
Supplemental material for this article is available online. Figure 19. Once the precise shape contour of the object to be grasped is obtained, the appropriate grasping points are calculated to guide the robotic arm to execute the grasp task. (a) The initial state of the robotic arm; (b) the gripper was moved above the object according to the recognition result; (c) the gripper was moved down to the object; and (d) the gripper was rotated to form an appropriate grasp posture according to the grasp points. It is difficult to form such a posture if there is no such a pixel-level shape contour.