A lightweight model for digital printing fabric defect detection based on YOLOX

Online detection of digital printing defects is a necessary but challenging topic. The performance of the current detection methods is still not ideal for the diversified patterns of digital printing fabric defects and the realtime requirements of online detection. In this paper, we proposed a lightweight model of digital printing fabric defect detection based on YOLOX. Firstly, according to the characteristics of many types of defects and complex background of digitally printed fabrics, a defect detection network structure based on YOLOX is constructed. Then, the SE attention module is introduced to enhance important features and weaken unimportant features, which make the extracted features more directional. And it can further solve the influence of small feature size on the detection accuracy of small targets. The experimental results show that the proposed model has a detection accuracy of 66.2 mAP on our self-built dataset, which is 2.7 percentage points higher than YOLOX. This method can effectively solve the problem that low detection accuracy of small defects. The proposed model can meet the real-time requirements and improve the detection accuracy of small target defects.


Introduction
Due to its environmental protection, technology, fashion and other characteristics, digital printing has become more and more widely used in the textile field.But during the printing process, there are many factors that may affect the final fabric quality, such as uneven sizing and coloring of the fabric, nozzle failure, mechanical failure, worker error and so on.There are eight common fabric defects as follows: PASS track, PASS tracks, ink leakage, blank, fluff, water leakage, ghost, and pulp, 1 as shown in Figure 1.In order to detect defects in time and reduce losses, it is very important to find a fast and accurate method for fabric defect detection.The existing fabric defect detection mostly relies on manual detection, which not only has high labor cost, but also has inefficiency and insufficient detection accuracy. 2In recent years, with the rapid development of artificial intelligence, it has become a possibility to replace manual detection with network models of deep learning algorithms.
Target detection algorithms based on deep learning can be divided into "single-stage" and "two-stage"" target detection algorithms.Compared with the single-stage algorithm, the two-stage algorithm has higher detection accuracy.4][5] For example, Luo 3 et al. adopted a decoupled two-stage object detection framework based on convolutional neural network (CNN) in the surface defect detection of flexible printed circuit boards and achieved good detection results.Although two-stage algorithms have high detection accuracy, their detection speed is difficult to meet real-time detection scenarios in practical industrial applications.The You Only Look Once (YOLO) [6][7][8][9] series networks are single-stage models with higher real-time performance.7][18] For example, Liu 16  Although the above-mentioned fabric defect detection algorithms have improved the detection performance from different aspects, they all have the problem that the detection effect of small objects is not ideal, and because they all have anchor networks, the anchor frame will increase the computational cost during training.Therefore, this paper takes the single-stage anchor-free network YOLOX 19 as the model basis, then adds the SE attention module, which enhances the important features and weakens the unimportant features, thereby making the extracted features more directional, so as to improve the detection accuracy of small objects.The results show that the improved YOLOX algorithm has achieved a great improvement in our own fabric defect dataset, and can effectively solve the problem of low detection accuracy of small objects.

YOLOX model
The previous YOLO series of networks all have anchor networks.Multiple anchor boxes are used in the prediction segment to predict the category and position of objects.YOLOX is changed to an anchor-free network, which greatly reduces the number of parameters of the detection head while ensuring performance improvement.YOLOX can be divided into YOLOX-x, YOLOX-Darknet53, YOLOX-l, YOLOX-m, YOLOX-s, YOLOX-Tiny and YOLOX-Nano according to the network size.The network structure of YOLOX can be divided into three parts: the backbone network, the neck network and the prediction head.

Backbone network
The backbone network of YOLOX adopts the YOLOv3-SPP structure.In the network structure of CNN, the final classification layer is composed of full connections.The number of fully connected features is fixed, so the size of the image is fixed when it is input to the network.However, images need to be cropped, stretched, or deformed that due to the variety of image sizes in reality, so leads to the distortion of the image and affects the final detection accuracy.The full name of SPP is Spatial Pyramid Pooling, which was proposed by He et al., 20 that effectively solves the problem of image distortion.At the same time, the SPP module integrates local features and global features, which enriches the expressive ability of the feature map and improves the detection accuracy.
In addition, the backbone network of YOLOX also uses the Focus network structure, which is to get a value for every other pixel in a picture, so that four independent feature layers will be obtained, and then the four feature layers will be stacked, so that the width and height information is concentrated into the channel information, which quadruples the input channel.This is equivalent to a downsampling but no calculation is performed, which can reduce the computational cost.The specific method is shown in Figure 2.
In addition, the author also changed some training strategies, adding training techniques such as EMA weight update and Cosine learning rate mechanism, using the Intersection over Union (IoU) loss function to train the reg branch, and the BCE loss function to train the cls and obj branches.

Neck network
The neck network of YOLOX adopts the structure of FPN + PAN.FPN 21 is top-down, which integrates highlevel features and bottom-level features through upsampling, and conveys strong semantic features from top to bottom.PAN, on the other hand, fuses bottom-up features with down-sampling and high-level features, and conveys strong localization features from bottom-up, so that both the semantic features and location features of the entire convolutional layers are enhanced.The specific method is shown in Figure 3.

Head network
The prediction segment of YOLOX has been changed from the YOLO Head of the previous YOLO series to the Decoupled Head, which greatly improves the convergence speed.It consists of a 1 × 1 conv layer to reduce the number of channels, and then two parallel branches with 3 × 3 conv layers each.The specific structure is shown in Figure 4.
The previous YOLO series of networks used Anchorbased, but this requires anchor points to be designed in advance, densely sampled images, and contains a lot of negative sample information, which requires a huge amount of computation.YOLOX adopts the Anchor-free method to reduce the number of predictions per position from 3 to 1.This method reduces the number of parameters in the prediction segment and improves performance and speed.

Loss function
The loss function of YOLOX includes three parts: iou_ loss, obj_loss and cls_loss.Among them, iou_loss is used to calculate the detection effect of the predicted frame and the real frame, as shown in formulas (1) and ( 2), obj_loss is used to determine the target box is the object or the background, cls_loss is used to determine the category, and both use BCE With Logits Loss, as shown in formulas (3) and ( 4): A in formula (1) is the prediction frame, and B is the real frame.The input tensors in formulas (3) and ( 4) are (x, y), N is the batch size, and n labels are predicted for each batch, σ representing the sigmoid function.

SE module
The full name of the SE module is the Squeeze-andexcitation module, which mainly includes two parts: Squeeze and Excitation.The basic principle of SE is to process the convolved feature map, and then obtain a onedimensional vector consistent with the number of channels, the number on this vector is used as the evaluation score of each corresponding channel, and then filter out the channel attention.The whole process is shown in Figure 5.
The first step is the compression operation, using a global average pooling, after which the feature maps are compressed into 1 × 1 × C vectors.The second step is the excitation operation, which uses two fully connected layers and an activation function to establish a connection between channels, and then normalized weights is obtained through the Sigmoid activation function.The last step is the Scale operation, which is weighted channel-by-channel to each channel of the original feature map through multiplication to complete the re-calibration of the original features by channel attention.

Improved YOLOX model
Since the size of the small target in the feature map is tiny, it is difficult for the features to be effectively trained, resulting in poor feature training for small targets and low detection accuracy.In this regard, this paper adds the SE module to enhance the important features and weaken the unimportant features by controlling the size of the scale, so as to make the extracted features more directional, thereby improving the small target detection effect.The network structure is based on YOLOX, and the improved network structure is shown in Figure 6.The overall improvement idea is as follows: add SE modules to the three effective feature layers input to the feature pyramid part, so that the feature pyramid of the Neck part is fused with the important features that have been calibrated and screened during upsampling, thereby improving the model's ability to small target detection capability.

Datasets and data augmentation
The experimental data adopts the fabric defect images collected by ourselves, and uses Labelimg software for data calibration.There are eight types of defects in the dataset: PASS track, PASS tracks, ink leakage, blank, ghost, fluff, water leakage, pulp.There are10,320 images in total, 80% of which are used as training set, 10% as validation set, and 10% as test set.
Data augmentation can expand the dataset and improve the robustness of the model.This experiment uses the Mosaic data enhancement method.The basic principle is to crop four pictures at random, and then stitch the cropped pictures into one picture, which not only enriches the background of the picture, but also improves the batch size in disguised form.The specific operation is shown in Figure 7.

Model training
The experiment is trained and tested on the Ubuntu operating system.The CPU uses Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50 GHz, the memory is 128 G, the graphics card

Evaluation indicators
In this paper, the Average Precision (AP), the mean Average Precision(mAP), the Frames Per Second (FPS), and the Parameter are used as evaluation indicators.The AP and mAP are used to evaluate the detection accuracy of the model, the FPS is used to evaluate the detection speed  In this paper, the precision rate and the recall rate are used to calculate the AP.The calculation formulas of the precision rate and the recall rate are formula ( 5) and formula (6), The calculation formulas of AP and mAP are formula (7) and formula (8):    Where TP is the number of correctly predicted positive samples, FN is the number of incorrectly predicted negative samples, and FP is the number of incorrectly predicted positive samples, k is the total number of categories detected.

Experimental results and analysis
In general, pre-trained model will be downloaded from the Internet before network training, and the online pre-trained model will use different datasets for training, the commonly used public datasets are VOC dataset and MS COCO dataset.
In order to verify the influence of pre-trained model on the final network performance, we downloaded two kinds of pre-trained models based on different datasets.And then we trained them based on our algorithm, and selected two kinds of datasets to ensure the accuracy of experimental results.The final experimental results are shown in Table 1.
As can be seen from Table 1, for the same kind of network and the same kind of dataset, the model of pre-trained network based on COCO dataset ultimately has better network performance.Therefore, the pre-trained model based on COCO dataset is selected for all the networks.
The loss variation curve trained using the improved YOLOX network model is shown in Figure 8.According to the idea of transfer learning, the features of the backbone network are generic, so in order to speed up the training, a pre-trained model that has been trained on the COCO dataset is used.The training is divided into two stages, namely the freezing stage and the unfreezing stage.The freezing stage is set to speed up the training speed under the condition of insufficient machine performance.Unfreeze the backbone network after training for 50 epochs, and then train for 150 epochs, for a total of 200 epochs.As can be seen from Figure 9, the loss value decreased in the first 50 epochs, and fluctuated after unfreezing the backbone network, and then began to decrease rapidly.The training loss and validation loss gradually stabilized after 175 epochs, stabilized at 2.59 and 2.88 respectively, and the model reached a state of convergence.
In order to verify the performance of the improved model, we compared the commonly used lightweight object detection network model with our model, and the detection accuracy results are shown in Table 2.The comparison shows that our model has the best performance on mAP, AP 50 , AP 75 , AP M , and AP L .Compared with YOLOX-Tiny, mAP performance improved by 2.7 percentage points, AP 50 performance improved by 2.3 percentage points, AP 75 performance improved by 3.3 percentage points, AP S performance improved by 1.3 percentage points, AP M performance improved by 2.4 percentage points and AP L performance improved 4.2 percentage points.
The detection speed results are shown in Table 2.The network model with the minimum of parameters is YOLOX-Tiny, our model is the second, because the SE module is added, but the amount of parameters added is very small, only 0.024 M. The model with the fastest detection speed is YOLOv4-Tiny, our model is the third fastest, but the detection accuracy of YOLOv4-Tiny is the worst, so in terms of overall model performance, our model performs the best.Compared with other models, since the improved algorithm in this paper increases the SE module, the calculation amount increases and the detection speed decreases slightly.However, when the speed reduction is not obvious, the detection accuracy of the improved algorithm in this paper is significantly higher.Therefore, the improved algorithm proposed in this paper can effectively improve the performance of the fabric defect detection.The final detection effect is shown in Figure 9.
In order to prove the robustness of our model, we also used VOC public dataset to conduct comparative tests, and the experimental results are shown in Table 3.
It can be seen from the Table 3 that in terms of detection accuracy, SSD has reached the optimum in mAP, AP 50 , AP 75 and AP L , with values of 52.3, 77.5, 59.0 and 60.9.Our model reached the optimal value in AP S and AP M , with values of 19.6 and 31.8.Because SSD has a large number of network parameters, so the detection performance will be stronger after using a large amount of data for training.However, due to the large number of parameters, more resources need to be calculated.Therefore, SSD performs poorly in terms of detection speed and FPS can only reach 10, which cannot meet the real-time requirements.The FPS of our model reaches 30, which meets the requirement of real-time performance.Therefore, our algorithm is still the best from the comprehensive performance.

Conclusion
This paper proposes a YOLOX defect detection algorithm that introduces SE attention module.The method is based on the YOLOX network structure.By introducing the SE attention module, the important features are enhanced and the unimportant features are weakened, so that the extracted features are more directional, which effectively alleviates the impact of small feature size on the detection accuracy of small targets, thereby the detection accuracy of the model for small targets is improved, and multiple performance indicators of the model are improved without increasing the amount of model parameters, and it meets the requirements of real-time detection.The experimental results show that the detection precision mAP value of this  algorithm in our data set reaches 66.2, AP 50 reaches 93, and other four detection indexes also reach the optimal.In terms of detection speed, our algorithm reaches 31 FPS, which meets the real-time requirement.In summary, the comprehensive performance of our algorithm is optimal, which can realize the detection task of digital printing fabric defects.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
10 et al. proposed a metal surface defect detection model based on the improved YOLOv3 model, which has made great progress compared with traditional manual detection in terms of accuracy and real-time performance.Kou 12 et al. proposed a YOLOv3-based strip steel surface defect detection model.By introducing a specially designed dense convolution to extract rich feature information, the detection effect of the detection model is better than other models.
et al. proposed an improved Single Shot MultiBox Detector (SSD)-based model for fabric defect detection.Xie 17 et al. proposed an improved fabric defect detection model based on the SSD algorithm by adding a fully convolutional squeezing and excitation (FCSE) module to improve the detection accuracy of the model.Adjust the number of default boxes to suit the detection of long defects on the fabric surface.

Figure 2 .
Figure 2. Focus structure:(a) is the 3-channel feature diagram of the input image and (b) is the 12-channel feature diagram with the width and height information concentrated on the channel.
uses NVIDIA RTX 2080Ti, and the deep learning framework is Pytorch.During the training process, the training batch size is set to 16, the number of images in the training set in the experiment is 8256, one training batch contains 16 images, and one training epoch contains 516 training batches, the experimental training cycle is set to 200 epochs, of which the first 50 training cycles are the freezing stage, that is, the model backbone is frozen, and the feature extraction network does not change, so the existing occupation is small, only fine-tuning the network can speed up the model training speed.The model is trained by stochastic gradient descent (SGD), the initial learning rate is set to 0.01, and the learning rate is adjusted by cosine annealing algorithm.

Figure 5 .
Figure 5. Overall structure of SE module.

Table 1 .
Performance comparison experiment of different pre-trained models.

Table 2 .
Results of the detection accuracy comparison.

Table 3 .
Comparative test based on VOC dataset.