Recognize highly similar sewing gestures by the robot

The autonomous and efficient learning of sewing gestures by robots will bring great convenience to the garment industry. To improve the accuracy of robots in detecting sewing gestures with high similarity, three detection models based on deep learning are proposed in the paper. First, in order to improve the detection accuracy and detection speed of sewing gestures under complex backgrounds, we added a dense connection layer to the low-resolution network layer of YOLO-V3 to enhance the transmission and reuse rate of image features. Secondly, a deeper ResNet50 residual network is introduced to replace the VGG16 basic network in the original SSD model. The feature pyramid structure is used to fuse high-level semantic features and low-level semantic features, which can improve the detection accuracy of small-sized sewing gestures. Finally, the parallel spatial-temporal dual-stream network separately extracts the temporal feature and the spatial feature of sewing gestures. The fusion of time feature and space feature improves the detection accuracy of the coherent sewing gesture. The results show that the suggested three models can effectively detect four sewing gestures with high similarity. Among them, the spatial-temporal two-stream convolutional neural network has the highest detection accuracy. The improved SSD model has faster detection speed than the improved YOLO-V3 model and other mainstream algorithms.


Introduction
In the textile field, the fabric of the garments is soft and extensible, which makes the sewing process difficult. Many factors such as the human body and the environment need to be considered during the sewing process. 1 Cooperating with robot technology and skilled worker technology can achieve better sewing performance. 2 Collaborative robots accurate and rapid understanding of workers' sewing skills is the key to human-machine collaboration. 3 Scholars have done a lot of research on how to solve the problem of robots' learning of sewing skills. Takashi 4 applied the variable gain learning control method to automatic sewing trajectory tracking of the robot arm. Panagiotis 5 used fuzzy logic to define the expected environmental feedback, designed a hierarchical control system, and realized the robot's adaptive sewing function to unknown sewing materials. Paraskevi 6 combined visual servo networks and neural networks to design an adaptive reasoning learning system for curved fabric sewing robots. Huang et al. 7 divided the worker's sewing actions into segments, each segment was modeled by the GMM model, and GMR was used to obtain the entire sewing action to complete the learning of human sewing actions. The above methods can solve the perturbation of robot parameters and the uncertainty of the contact stiffness of the external working environment to a certain extent.
The sewing action recognition method based on computer vision 8 detects various action categories by extracting detailed feature information in the sewing action. At this stage, human action recognition methods mainly include image samples based on a single keyframe and video samples. 9 The image recognition method based on a single keyframe is fast and easy to implement. It can well detect some actions with representative keyframes. The recognition method based on video samples obtains the information of the action space and time dimension very well. This method is more flexible and scalable. It has a good recognition effect for complex and coherent actions. Generally, characteristics of the action-image are obtained by the method based on artificial design features 10 and the method based on deep learning features. 11 Yilmaz and Shah 12 extracted the specific information of the action according to the change of the space-time volume of the moving target in the time series, which has good robustness against changes in the viewing angle. Gorelick et al. 13 incorporated the structural positional relationship in the time and space of the action into the characteristics. Jiang et al. 14 used the information of key points to construct action descriptors and clustered key points and analyzed actions by establishing judgment rules through Euclidean distance. Dollar et al. 15 found that it was difficult to achieve good results when there are obstacles in the overall thinking methods. Matikainen et al. 16 found that the description of the outline of the action alone cannot well represent the texture features in the middle of the outline. Therefore, the representation method of local features is more important. Liu et al. 17 used spatial-temporal interest point detection methods (STIPs) for target detection, which provides new ideas for local representation methods. Matikainen et al. 16 used the trajectory of the moving target to extract local features and trajectory direction information to represent the local descriptive factor. Jiang et al. 18 and Wang and Schmid 19 optimized and adjusted the motion trajectory captured by the vision sensor, which improved the method of directly using the trajectory speed. The abovementioned artificial feature design method can only perform well in simple action recognition scenarios.
With the continuous development of deep learning in the field of gesture recognition, Asadi-Aghbolaghi et al. 20 investigated the current methods of gesture recognition in image sequences based on deep learning. The detection algorithm based on deep learning extracts a variety of detailed features of actions from single-keyframe image samples and video samples. Various sewing action categories are detected by describing the posture of the movement feature information. Gesture detection methods based on a single keyframe include region-based target detection network and regression-based target network. 21 The former representative networks include Faster R-CNN, 22 Mask R-CNN, 23 etc. This type of network has higher detection accuracy but a slower detection speed. Another type of regression-based target detection network includes YOLO, 24,25 SSD, 26 etc. Narayana et al. 27 concentrated the spatial channel on the hand and used a sparse network for fusion, which improved the accuracy of gestures recognition. The regression-based target detection network uses end-to-end detection, which has a faster detection speed. The image detection method based on a single keyframe has low accuracy and speed in recognizing sewing gestures in complex scenes. The method based on video samples can effectively use context information, which has high utilization, flexibility, and scalability. Zhang et al. 28 used multi-scale feature fusion and pyramid spatial pooling to detect salient target regions of different sizes in the video. Ji et al. 29 improved the formation of three-dimensional CNN based on two-dimensional CNN. This method inputs the spatial information map of the decomposed video action sequence into the spatial stream convolutional network for iterative training. Wang et al. 30 proposed a 3D-CNN combined with an LSTM network model to detect the saliency of behavioral targets, which reduces the model parameters and the difficulty of training. Wu et al. 31 used dual-stream CNN and Long Short-Term Memory to model the spatial and temporal information relationship between video frames. The semantic description of the video generated by the model realizes the classification and labeling of actions, which is significantly improved in public data sets. Simonyan and Zisserman 32 divided the video samples into spatial frames, and the divided spatial frames were sequentially sent to the convolutional neural network for iterative training. In the dualstream convolutional neural network, the spatial stream CNN network is used to extract the position features of the limbs in the static picture; the time-stream convolutional neural network is used to extract the movement information of the limb trajectory in a series of time series. The fusion of temporal and spatial features greatly improves the accuracy of gestures recognition.
Four sewing gestures of inner overlock seam, hemmed seam, cladding seam, and cut fabric are very similar. Due to the complex environment of the sewing workshop, the robots cannot accurately and quickly understand and recognize the sewing movements of the workers' hands. Therefore accurately identifying similar sewing gestures in complex backgrounds is the focus of this article.
The rest of the paper is organized as follows: In section 2, the improved YOLO-V3 convolutional neural model is used to detect sewing gestures with complex backgrounds. In section 3, the improved SSD method is used to solve the problem of low accuracy of YOLO for small target sewing gesture detection. In Section 4, the spatial-temporal dualstream convolutional neural network method is used to solve the problem of poor detection of sewing gestures with high similarity of sewing gestures and strong continuity of motion in independent single-frame images. Section 5 shows the experimental results. Conclusions and future research work are in Section 6.

YOLO-V3 improved model
The YOLO-V3 target detection network converts some detection problems into regression problems. The network uses the DarkNet53 network as a basic network to extract features. The network divides each picture into multiple grids. Each grid predicts a target bounding box, a confidence score, and a category conditional probability. In the detection process, the grid that falls into the center of a sewing gesture is used to identify the sewing gesture target. When multiple bounding boxes detect the same target at the same time, the non-maximum suppression (NMS) method is used to select the best bounding box. The accuracy of a predicted bounding box is reflected by the size of confidence. The YOLO-V3 network model detection process is shown in Figure 1.
Due to the limitations of the Darknet53 network, simply increasing the network level will cause the gradient to disappear and explore, which leads to low accuracy in detecting sewing gestures with complex backgrounds. The paper adds DenseNet densely connected network to the lower resolution transmission layer of the original YOLO-V3 network, which can enhance the transmission of sewing gesture feature information and promote feature reuse and integration. Each layer in the dense connection of DenseNet can directly access the gradient from the loss function and the original input signal, which greatly solves the problem of gradient disappearance when the number of network layers is too deep. The transfer function in the form of BN-ReLU-Conv (1 × 1) + BN-ReLU-Conv (3 × 3) is used in paper. The 1 × 1 convolution kernel in the transfer function prevents the input feature map from being too large.
The improved YOLO-V3 network model is shown in Figure 2.
DenseNet-1 and DenseNet-2 composed of four dense layers replace the original 10th layer (the resolution of this layer is 64 × 64) and the 27th layer (the resolution of this layer is 32 × 32) respectively. In the dense structure, the input x 0 and the output  The stochastic gradient descent (SGD) network is used to optimize the network model. The learning rate is adjusted according to the number of iterations. The initial learning rate is 0.001. When the model is iterated to 8000 times and 9000 times, the learning rate decays by 10 times. We set burn-in to 1000 at the beginning of training. When the number of updates is less than 1000, the learning rate strategy changes from small to large. When the number of updates is greater than 1000, the learning rate adopts an update strategy from large to small.
In the bounding boxes of all tested sample data sets, we use the K-means cluster analysis method to find 9 a priori box dimensions suitable for the sample data. The K-means distance measurement formula is shown in formula (1). . Arranging the size of the prior frame from small to large, and it is evenly divided into three feature maps of different scales. Feature maps with larger scales use smaller prior boxes. Finally, these prior boxes are used to detect sewing gestures.
Training of YOLO-V3 improved model. The improved YOLO-V3 model uses 4000 sewing gesture images. The model is trained 5000 times. The dynamic process of training is observed by drawing the loss curve. The corresponding loss value change curve is shown in Figure 3.
It can be seen from the curve in Figure 3: the loss value of the model decreases rapidly in the previous iterations, which means that the model fits quickly; when iterative training is 2000 times, the loss value of the model slows down; when the iteration reaches 5000 times, the loss value converges to 0.0025.   The detection performance of the model is further evaluated by calculating the Avg IOU between the predicted bounding box and the true bounding box. The change curve of the Intersection-over-Union is shown in Figure 4.
As the number of model iterations increases, the Intersection-over-Union of the real frame and the predicted frame is constantly improving. When the number of iterations is 5000, the model Intersection-over-Union tends to 0.9.
Determination of optimal threshold of YOLO-V3 improved model. We select the best prediction model from trained models. By calculating the accuracy rate, recall rate, F1 value, and average Intersection-over-Union under different confidence thresholds, the best threshold model is screened out. The result is shown in Figure 5.
In the threshold interval of (0, 1), the value is calculated once every span of 0.05, and a total of 20 sets of data are calculated. We set the following rules on the priority: accuracy rate > recall rate > IOU. After the threshold reaches 0.6, the accuracy rate gradually stabilizes, and the optimal range is about 0.6-1.0. In this range, the best recall rate is 0.85, the corresponding confidence threshold is 0.6, and the IOU value is about 0.71. Therefore, we choose 0.6 as the optimal threshold.
After selecting the best model, we use the accuracy rate as the vertical axis and recall rate as a horizontal axis to obtain the P-R curve of the best model, as shown in Figure 6.
By observing the P-R curve, it can be seen that the value of recall rate measured at balance point is equal to the value of accuracy rate at about 0.85.
The improved YOLO-V3 model has a better detection effect on large-size sewing gestures, but the accuracy will decrease when recognizing smaller sewing gestures. The reason is that the single convolutional network in YOLO-V3 easily ignores low-level feature information. The high-level features can only provide a small part of the feature information about a small target to the model, which is not good for sewing gesture recognition with a small size.

SSD improved model
The SSD network uses multi-scale target features for target detection, which can improve the robustness of targets of different scales. Compared with the YOLO-V3 network, the SSD network has a higher recognition rate for smallsized targets. The number of deep network layers can reduce the constraints of gradient instability problems such as gradient disappearance and gradient explosion. The SSD model based on VGG16 only deepens the network depth, which leads to the aggravation of network degradation.
We introduce deep residual network (ResNet) and feature pyramid structure (FPN) into the basic SSD model. The deep residual network (ResNet) improves the multiplicative transfer between feature layers to additive transfer, which can enhance the connection between the front and back network layers. While ResNet effectively avoids gradient instability, it does not increase the parameters and complexity of the model. The calculation formula of the deep residual network (ResNet) is shown in formula (2).   (2) In formula (2), l represents the number of network layers. x l represents the output of the layer l . H l represents a non-linear transformation. For the ResNet network, the output of the layer l is a nonlinear transformation of the output of the layer l −1.
The feature pyramid structure is used to realize the fusion of high-level and low-level feature information (the feature pyramid structure is shown in Figure 7). The down-sampling part calculates the feature level composed of multiple scale feature maps, and the deepest layer has the strongest feature information. The upsampling part extracts higher-resolution features, and up-sampled features are enhanced by down-sampled features through the horizontal connection. Each horizontal connection part realizes the fusion of feature information of different sizes in high and low layers, which strengthens the detection ability of small-size sewing gestures.
The input image of the SSD improved model is a 512 × 512 RGB image. Its backbone superimposes five convolution modules based on ResNet50 to form a deep network. The structure of the improved model of SSD is shown in Figure 8.
The network of the SSD improved model contains 59 convolutional layers and five maximum pooling layers. The shallow and middle layers of the model extract feature information of sewing gestures, which are used to detect small-sized sewing gesture targets. The deep network extracts the sewing gesture features that contain the full image, and these features are used to detect large-sized sewing gesture targets. SSD network draws feature maps from Conv4, Conv5, three additional convolutional layers, and one pooling layer respectively. Each feature map will generate a priori frames of multiple scales, and generate prediction frames of multiple scales based on different frame ratios to judge sewing gestures of different scales to achieve multi-scale target position and category prediction. The detection result is produced by non-maximum suppression.
Parameter setting of SSD improved model. The SSD improved model uses a self-built sewing gesture data set to train the model. In the training process, 64 samples are used as the number of samples in a batch, and parameters are updated once for each batch of samples. The number of samples in each batch is divided eight times, and the divided samples are sent to the trainer. The learning rate adopts a dynamic adjustment strategy, and the initial learning rate is set to 0.001. The real frame and default frame matching strategy adopt the maximum Intersection-over-Union strategy. If the Intersection-over-Union is greater than the threshold value of 0.5, matching is performed. The parameter settings are shown in Table 2.
We use the idea of transfer learning to apply the feature extraction capabilities learned by the network on the largest  dataset ImageNet to small sample data and fine-tune it to further improve the detection effect of sewing gestures. The training is based on the ImageNet super-large image data set and the trained ResNet50 classification network model. The fully connected layer after the classification network is removed, and the convolution detection module is added to form the target detection network. We migrate the convolution model and parameters in the ResNet50 network, and we freeze the parameters of each layer in the middle of the model. Finally, a deep migration training model based on ResNet50 can be obtained for training and fine-tuning parameters for model category prediction and position regression.
Training of SSD improved model. The SSD improved model optimizes the loss function during training. Regression training is carried out through the location and target category, and the back-propagation mechanism is used to continuously update the model so that the Loss value is continuously reduced. We observe the dynamic training process of the model by drawing the loss function curve. The model loss function curve is shown in Figure 9. The loss function in the model is defined as shown in formula (3).   N is the number of a priori boxes matching the object box. z is the matching value of the default box and real object boxes of different categories. α is the weighting factor of the relationship between confidence loss and location loss (the default value is 1). c is the confidence of the predicted object frame. l is the position information of the predicted object frame. g is the position information of the real object frame. It can be seen from Figure 9 that the loss value of the improved SSD model during the pre-training period drops rapidly, and the model fits quickly. After 3000 iterations of training, the model loss decreases slowly. In order to prevent over-fitting, early-stopping is used in training, and the training is automatically ended when the loss value does not decrease for four consecutive epochs.
Both the SSD improved model and YOLO-V3 improved model only use spatial dimension information of a single frame to detect sewing gestures. These methods lack timeseries information between adjacent frame sequences, which have great limitations for the recognition of sewing actions.

Model of spatial-temporal dual-stream convolutional neural network
The time flow information of sewing gestures is a powerful feature to distinguish similar gestures. Therefore, we use the spatial stream CNN network and time stream CNN network to extract the spatial and temporal features of sewing gestures to improve the effect of gesture recognition. The spatial flow network and time flow network are used to extract the features of the spatial position and temporal motion of sewing gestures respectively. Then, the spatial location characteristics and temporal motion characteristics that reflect sewing gestures are merged. The structure diagram of the dual-stream network model is shown in Figure 10.
The backbone of the spatial flow network model adopts the VGG model. The model takes a continuous RGB image of 224 × 224× × 3 as input, and the output is the probability distribution of different sewing gesture categories. The dual-stream network is composed of 13 convolutional layers and three fully connected layers. The convolution kernel has a size of 3 × 3 and a step size of 1 × 1, which is stacked into five blocks. The maximum pooling layer has a kernel size of 2 × 2 and a step size of 2, which is connected after the convolutional layer. A fully connected layer maps the sewing gesture feature to the feature vector. Finally, the Softmax classifier is used to output the probability distribution of different sewing gesture categories through four neurons.
The parameters of the temporal flow network model are consistent with the spatial flow convolutional neural network, and the size of the input optical flow graph is 224 × 224 × 2L. To make better use of the motion timing characteristics in optical flow, we superimpose multiple optical flow diagrams, and the optical flow superimposed diagrams are sent to the time flow network model. The optical flow overlay is the "instantaneous velocity" of the pixels of the moving target in the continuous video frame in space. It can clearly and effectively characterize human body motion information, which greatly improves the performance of the dual-stream model. The dense optical flow method is used for the calculation of the optical flow diagram. After inter-frame segmentation is performed on sewing video samples, a set of dense optical flow vector fields between adjacent video frames t and t + 1 is calculated. The two-bit gray-scale optical flow diagram of the previous time is generated by tracking the frame at the next time, and it is decomposed into horizontal and vertical components. A single optical flow graph is compressed to form multiple optical flow groups which are input to time flow convolutional neural network for training. Equation (4) calculates the horizontal component.
In formulas (4) and (5) components. w , h , and L represent the width, height, and length of sewing gesture video frame, respectively. τ is the current video frame. Which layer merges the features extracted by temporal flow network and space flow network is a problem that we need to consider. In the following text, the fusion layer with the highest detection accuracy is taken as the best fusion position.
Parameter setting of spatial-temporal dual-stream network. The dual-stream network initializes the parameters by loading the pre-training weight file, which makes the network parameters in a better position to improve training speed. Some parameters in the training process are set according to experience, and the parameter values are shown in Table 3.
During training, the number of samples is 96 as a batch of samples, and the parameters are updated once for each batch of samples. Continuous 10 frames of RGB video frames are used as the input of spatial stream to achieve a balance between the computational complexity and amount of data information. The size of the input space map is 224 × 224 × 3. The input of the time flow network is the superposition of horizontal and vertical optical flow graphs, and the size of the superposed optical flow graph is 224 × 224 × 2L. According to previous experience, the  effect of superimposing the optical flow diagram with L = 10 in the time domain is the best. The small-batch stochastic gradient descent method is used to optimize the dual-stream network model. The initial learning rate is 0.001, which is reduced according to a fixed schedule. Changing the learning rate to 10 3 − after 5000 iterations. After 7000 iterations, the learning rate is changed to 10 4 − .
Training of spatial-temporal dual-stream network model. The training of the dual-stream network model uses 4000 images for a total of 12,000 training times. During the training process, we draw the loss value change curve and detection accuracy value change curve of the dual-stream network model to observe dynamic the training process. The loss value change curve is shown in Figure 11. The change curve of detection accuracy value is shown in Figure 12.
It can be seen from the loss value change curve that the loss value of the parallel spatiotemporal dual-stream convolutional neural network model is rapidly reduced in the previous iteration, and the model is quickly fitted. After iterative training 4000 times, the downward trend of model loss is slowing down. After about 8000 iterations, the model converged to 0.3. It can be seen from Figure 12 that the accuracy rate increases as the number of iterations increases. When iterated 3500 times, the detection accuracy rate increased rapidly. After 4000 iterations, the detection accuracy rate remains at 92.6%.

Processing of experimental data sets
Collection of sewing gesture data set. The sewing gesture dataset samples mainly come from the sewing workshop. The data set contains more than 1000 sewing gesture motion videos, which include four sewing gestures with a high similarity: inner overlock seam, hemming seam, cladding seam, and cut fabric (as shown in Figure 13). It also includes action videos of gestures at different periods.
In the experiment, we frame 1000 self-built video samples at a frequency of 1 10 7 × HZ . The total number of samples after framing is 6670, which includes four sewing gesture categories. Among the four sewing gestures, there are 1600 inner overlock sewing gesture targets, 1660 hemming sewing gesture targets, 1700 cutting fabric sewing gesture targets, and 1710 cladding sewing gesture targets. The data samples after framing are combined into data set 1 according to the sequence of framing. After that, the optical flow diagrams of all 6670 data samples in data set 1 are obtained, and the obtained optical flow diagrams are sequentially formed into data set 2 (as shown in Figure 14). Data set 1 is used as the input sample of the YOLO-V3 improved model, SSD improved model, and spatial stream CNN model. The optical flow graph in data set 2 is used as the input sample of the time flow CNN network model. The sample ratios of the training set, validation set, and test set in the two data sets are all 6:2:2.
Enhancement of data set. To improve the quality and richness of the experimental data set, we use traditional image enhancement methods to process the color, brightness, and direction of the segmented image. The 6670 video frames of sewing gestures in data set 1 are enhanced, and pictures of the same category are added to data samples of the same category. The time-series optical flow diagram in data set 2 is randomly flipped and cut into the time flow network model of the dual-stream network. The number of enhancements in the data set is shown in Table 4. The enhanced picture is shown in Figure 15.       Pique seam  a  120  120  1100  160  100  Hemmed seam  b  120  120  1160  160  100  Cut fabric  c  120  120  1200  160  100  Cladding seam  d  120  120  1210 160 100   Table 5.
The detection result is shown in Figure 16. Table 5 shows that the improved YOLO-V3 model has a higher accuracy value than the other three networks. Since the improved YOLO-V3 model has a higher image feature reuse rate than the original YOLO-V3 model. The average accuracy is increased by 2.29% compared with the original YOLO-V3. In terms of detection speed, the improved model has a speed of 43.0 frames/s, which is much higher than R-CNN, and the detection speed is the same as that of YOLO-V2 and YOLO-V3. The model checking effect is shown in Figure 16. It can be seen from the actual detection results that the improved YOLO-V3 model can accurately detect four different sewing gestures compared to the other three models. The improved YOLO-V3 network is more accurate in frame overlap.  Table 6. The detection result is shown in Figure 17.
Through the comparison of detection speed and accuracy results: In terms of detection speed, the improved SSD model and YOLO and SSD model are 46, 47, 51 frames/ S −1 , respectively. The three detection speeds are almost the same. In terms of detection accuracy, the final detection speed and accuracy of the improved SSD network are higher than the other three networks. Because the improved SSD model introduces the FPN structure, the original up-sampling structure is improved, and the highlevel semantic features and low-level semantic features are better integrated, which leads to a greatly reduced probability of false detection and missed detection in the detection of small targets. Compared with Faster R-CNN network, YOLO network, and SSD network, the detection accuracy of small-area targets is increased by 4.81%, 10.59%, and 12.67%, respectively. Therefore, under the premise of ensuring a certain speed, the improved SSD model has significantly improved the detection accuracy of small-scale targets.
The detection results of sewing gestures by different fusion layers in spatial-temporal dual-stream convolutional neural network. To analyze the influence of different fusion positions in the spatial-temporal dual-stream convolutional neural network on the detection effect of sewing gestures, we combine the spatial features of sewing gestures and temporal features represented by dense optical flow graphs in convolutional layer and fully connected layer, respectively. We train on the public data sets UCF101 and HMDB51, and test on the self-built sewing gesture data set. The accuracy of the spatial-temporal dual-stream network is used to are the detection results of the four gestures by the YOLO-V2; c c 1 4 are the detection results of the four gestures by the YOLO-V3; d d 1 4 are the detection result of four gestures by the improved YOLO-V3. measure the recognition effect of complex and coherent sewing gestures. Table 7 shows the recognition results of sewing gestures in different feature layers of the spatialtemporal dual-stream network. It can be found from Table 7 that the recognition accuracy is closely related to the location of fusion. With the deeper the fusion location layer, the model can obtain higher semantic information, and the recognition accuracy has a certain improvement. By comparing in the fusion effect of the fully connected layer and convolutional layer, it is found that the dual-stream network can capture more efficient high-level semantic expressions when the fully connected layer is fused so that the recognition accuracy is increased by at least 3.3%. When fusing the space network layer fc7 layer and time network layer fc6 layer, the recognition accuracy rate drops by 0.8%. The reason for the decrease is that the deeper feature fusion cannot closely link the optical flow information with spatial information at the same time. The fusion model of the sixth fully connected layer of space and the seventh fully connected layer of time flow network can closely connect the spatial position information and temporal motion information of sewing gesture.
This layer fusion has very good detection results for sewing gestures and has the best performance on three data sets.

Comparison of detection accuracy of sewing gestures by YOLO-V3 improved network, SSD improved network, and dual-stream convolutional neural network
To verify the recognition effect of the three improved models proposed in the paper on sewing gestures, we use the improved YOLO-V3 detection method and the improved SSD detection network to train the model of video frame pictures with sewing gestures. On the other hand, a dualstream network detection method is used to train the sewing gesture data set. The detection results of the three detection networks for sewing gestures are shown in Table 8.
It can be seen from Table 8 that the recognition accuracy of the dual-stream network method is better than that of the improved YOLO-V3 model and the improved SSD   Table 9. From the results in Table 9, the method based on the deep neural network method can learn the semantic expression of high-level sewing gestures more easily, and the effect is better than the traditional IDT method. The dualstream network can effectively use time dimension information, and it has better detection accuracy than other detection models. Compared with P-CNN, C3D, and TDD methods, the improved method of YOLO-V3 has improved detection accuracy by 8.2%, 4.1%, and 2.1%, respectively. Compared with IDT, Two-Stream, P-CNN, TDD, and C3D methods, the improved SSD network has improved detection accuracy by 5.9%, 5.2%, 8.5%, 2.4%, and 4.4%, respectively. Compared with the original dual-stream, P-CNN, C3D, and TDD methods, the spatiotemporal dualstream network has improved recognition accuracy by 5.9%, 9.2%, 5.1%, and 3.1%, respectively. The improved SSD model has the fastest detection speed at 51 frames/s. The improved gesture recognition algorithm can ensure the detection speed while improving the recognition accuracy. Through the analysis of experimental results, the improved YOLO-V3 model, the improved SSD model, and the dual-stream convolutional neural network model are better than other networks in the recognition of sewing gestures. The effect of recognizing sewing gestures is better.

Conclusion
To recognize sewing gestures quickly and accurately, we recognize four gestures such as inner overlock seam, hemming seam, cladding seam, and cutting fabric. The improved YOLO-V3 model embeds a densely connected network in the original transmission layer of YOLO-V3 with a lower resolution to promote feature reuse and fusion. Compared with the original YOLO-V3 model, the recognition accuracy of four similar sewing gestures is increased by 2.29%. The improved SSD model uses the residual network to replace the original SSD backbone network. By combining the idea of a feature pyramid, the original up-sampling structure is improved, which better combines high-level semantic features and low-level semantic features. The overall recognition accuracy of the improved SSD model is 88.79%. The real-time detection speed reaches 51 frames/ s −1 . Compared with the SSD model, the YOLO model, and the Faster R-CNN model, the improved SSD model has increased the recognition accuracy of small targets by 4.81%, 10.59%, and 12.67%, respectively. The space-time dual-stream convolutional neural network can make full use of the temporal and spatial characteristics of sewing gestures, which greatly improves the utilization of feature information. Therefore, it has a higher recognition effect for sewing gestures. Experiments verify that the spatial-temporal dual-stream network has the highest recognition accuracy rate for sewing gestures, with a recognition accuracy rate of 92.6%. But its detection speed is slow. The real-time detection speed is 0.6 frames/s. The three networks meet the accuracy and real-time requirements of sewing gesture recognition under different environments and application conditions. The goal of future work is to recognize more and more complex sewing gestures. Besides, we will study how to further improve the recognition speed, and finally realize the real application in the field of human-machine collaborative sewing.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.