A context-aware progressive attention aggregation network for fabric defect detection

Fabric defect detection plays a critical role for measuring quality control in the textile manufacturing industry. Deep learning-based saliency models can quickly spot the most interesting regions that attract human attention from the complex background, which have been successfully applied in fabric defect detection. However, most of the previous methods mainly adopted multi-level feature aggregation yet ignored the complementary relationship among different features, and thus resulted in poor representation capability for the tiny and slender defects. To remedy these issues, we propose a novel saliency-based fabric defect detection network, which can exploit the complementary information between different layers to enhance the representation features ability and discrimination of defects. Specifically, a multi-scale feature aggregation unit (MFAU) is proposed to effectively characterize the multi-scale contextual features. Besides, a feature fusion refinement module (FFR) composed of an attention fusion unit (AFU) and an auxiliary refinement unit (ARU) is designed to exploit complementary important information and further refine the input features for enhancing the discriminative ability of defect features. Finally, a multi-level deep supervision (MDS) is adopted to guide the model to generate more accurate saliency maps. Under different evaluation metrics, our proposed method outperforms most state-of-the-art methods on our developed fabric datasets.


Introduction
The defect detection on the fabric surface is essential for the textile industry.During the production process, due to the influence of machine or human factors, various defects such as warp, weft, or point defect, often appear on the fabric surface. 1 Traditional manual visual detection methods tend to be susceptible to human factors such as inattentiveness or human fatigue, resulting in the low detection accuracy and efficiency.Importantly, unlike the automated detection method, the manual method is typically based on human expertise's experience in practice, so it cannot provide a quantifiable measurement for the effectiveness of defect detection.It has been reported that the accuracy of manual inspections can only reach 60-75% 2 based on feedback from the textile industry.A number of automatic detection methods have been proposed recently to replace manual detection with the development of digital image processing and machine vision, which are mainly divided into four categories: statistical methods, 3,4 structural methods, 5,6 spectrum methods, [7][8][9][10] model-based methods. 11,12Nevertheless, these traditional detection methods are susceptible to the change of illumination, background texture and camera angle, and thus limited its usage in the real-world applications.
Visual saliency detection is a method by imitating human visual characteristics 13 for determining where objects or regions are most likely to attract human attention in images that has attracted increasing attention and shown great success in a variety of fields, including hyper-spectral anomaly detection, 14 saliency detection, 15 and real-time wood classification. 16As fabric defects generally are salient in fabric images compared to the complex texture background, the fabric defect detection can be regarded as a salient object detection (SOD) problem, and thus several visual saliencybased fabric defect detection methods [17][18][19] have been developed and achieved good performance.Earlier saliency-based approaches, however, rely on hand-crafted features, such as color, intensity, and contrast, to characterize local details and global contexts, leaving them unable to characterize highlevel features with semantic clues, which limits their capability to locate complete defect objects in complex scenarios.
With the rapid development of convolutional neural network (CNN), CNN-based methods has broken the restrictions of traditional SOD methods that are dependent on hand-crafted features and pushed the performance of SOD to a new level owing to the powerful capability of extracting both high-level semantics and low-level texture details, which has been widely applied in various vision tasks, such as image classification, 20 semantic segmentation, 21 image retrieval, 22 image and video compression, 23 scene classification, 24 and object recognition. 25Benefiting from that, some learning-based methods [25][26][27][28][29][30][31][32][33][34] have been proposed to perform fabric defect detection, which have achieved remarkable detection performance and can ensure a promising speed-accuracy trade-off.However, there still remains two main challenges for the saliencybased defect detection task.
First, most of the existing saliency models based on feature fusion mainly use multi-mode fusion of multi-level features, ignoring the complementary relationship between different scales of features at different layers, which result in poor characterization ability of small and slender defects, leading to unsatisfactory detection results.As shown in Figure 1, the hole (see Figure 1(a)) and stain defects (see Figure 1(b)), as the common defects, have regular shapes and scales, which are relatively easy to detect.However, the defects such as slender defects (see Figure 1(c)), small scale defects (see Figure 1(d)), broken yarn (see Figure 1(e)), and off-yarn (see Figure 1(f)) are usually tiny and irregular in shape, which are more difficult to capture the effective features compared with the defects with regular shapes.Second, owing to the lower contrast with the background texture (see Figure 1(g) and (h)), the existing methods cannot effectively extract the defect features to improve the detection accuracy.
To address the above issues, we propose a novel saliency-based context-aware progressive attention aggregation network for fabric defect detection, named CPA 2 Net, which achieves brilliant performance in detecting all kinds of defects, especially the tiny and slender defects.Specially, a context-aware multi-scale module (CAMS) comprised of multi-scale feature aggregation unit (MFAU) is proposed to leverage a series of dilated convolutional layers for enlarging the receptive fields without increasing any computational load, which is conductive to defects detection with various scales and the representation learning of the model.MFAU cascades multiple dilated branches, in which the cross-level dense connection is added in a top-down manner to sufficiently aggregate the context-aware multi-scale multi-receptive-field feature for better predictions.
However, the aggregated features obtained by MFAU may contain redundant information, which may result in noise and blurry boundaries.Furthermore, considering that the defects commonly have low contrast with the background texture, it is essential to enhance the visibility or discrimination of the defects.As such, a feature fusion refinement module (FFM) is proposed to mitigate the above mentioned issues, which is mainly composed of the attention fusion unit (AFU) and the auxiliary refinement unit (ARU).Specifically, AFU is innovatively designed to reduce redundancy and adaptively select useful information for concentrating on important regions while suppressing the background interference, where a coordinate attention (CA) 35 module followed by sum and concatenation operations is adopted for embedding the positional information into channel attention to enhance the discriminant capability of defects features for complete predictions.In addition, the low-level features commonly contain rich detailed information, while the high-level features commonly contain more global semantic clues.Therefore, to take advantage of the cross-level complementary information for optimizing the training phase, the feedback operation is performed in a top-down pathway to transfer the semantic information from deeper layers to shallower layers.Then, ARU is designed for further refinement and highlighting the input features.More importantly, we adopt a multi-layer loss function as done in Hou et al. 36,37 to supervise the side outputs for more accurate prediction maps.
The contributions of our work can be summarized as follows: 2 Net is proposed for fabric defect detection, which highlights the discrimination ability of defects features representations for accurate detection.• • CAMS is designed to enlarge the receptive field by cascading several parallel MAFUs with dilated convolution layers for better copping with the scale variation of fabric defects, especially the tiny and liner defects.• • FFR composed of AFU and ARU is designed to make our model can not only combine spatial and channel information to focus on defective areas, but also enhance the contrast between the defects and the background to further enhance the discrimination of defects features.• • Compared with 14 state-of-the-art saliency models in terms of six evaluation metrics, our proposed method has shown superior performance on our built datasets.
The rest of this paper is organized as follows.Section 2 briefly discusses the related researches.Section 3 presents the details of our proposed saliency-based fabric defect detection method, in which we specifically describe the proposed CAMS, FFR, and the multi-level deep supervision (MDS) techniques.Section 4 reports the experimental results of our method and Section 5 gives the conclusions of the whole paper.

Traditional fabric defect detection methods
In this section, previous detection methods based on fabric defects were introduced in detail from two aspects, and the advantages and disadvantages of previous work were listed in detail, as shown in Table 1.
Statistical methods.Statistical methods usually distinguish defective and defect-free regions by analyzing the firstorder and the second-order statistics, such as histogram features, mathematical morphology, greyscale co-occurrence matrices and auto-correlation functions.Hamdi et al. 38 combined a greyscale co-occurrence matrix and Euclidean distance followed by a selection threshold to achieve fabric defect detection, however this method is not suitable for the color images.Zhang and Bresee 39 combined morphology with an auto-correlation function for defect detection, and is robust to illumination and noise, but it suffers from the higher complexity and computational cost.Spectral methods.Spectral methods locate defects by outstanding differences between defective and defect-free regions in the spectral domain, which have a good effect on highlighting the defects edges.Fourier 7 and wavelet transforms 10,40 and Gabor filtering 8,9 are commonly used in the spectral methods.Jing et al. 47 combined genetic algorithm and Gabor filter to filter the fabric image with pattern, and then segment the processed image for localizing the defects.Hu et al. 40 combined wavelet transform and Fourier analysis to examine fabrics with periodic texture background through an unsupervised algorithm.However, these models only work with repetitive textured or unpatterned fabric images, whose accuracy is highly dependent upon the specified parameters and filters, making them an inefficient tool for generalizing and adapting to new situations.
Structural methods.The structural method regards the texture as a primitive, extracts the structural features of the fabric texture and speculates on the position law, which can infer the texture background of the entire fabric image through the simple texture background structure law.Li et al. 41 do not match with those provided in the reference list.Please revise accordingly.]proposed an algorithm with self-adaptive partition block modeling via utilizing the characteristic of strong Model-based methods.Model-based fabric defect detection algorithms, which model the pattern and texture information to generate the model parameters, and perform defect detection by judging whether the model parameters are satisfied, such as Markov random field, 42 Bayesian model 43 and lowrank decomposition models. 48Zhang et al. 49 segment the jacquard warp-knitted fabric image through jacquard fabric characteristics and Markov random field theory.Mottalib et al. 50used Bayesian model to accurately classify fabric defects based on geometric features of defects.Nevertheless, the model-based methods are seldom utilized due to their high dependency on data and complex calculation.

Salient object detection-based fabric defect detection methods
Handcrafted feature-based salient object detection.In recent years, visual saliency has gained more research interest and has been successfully applied in many research fields.Inspired by that, many researchers have tried to model the visual saliency for fabric defect detection field and achieved great progress.By analyzing the saliency of local textures in context, Liu et al. 44 proposed a fabric defect detection model that performed significantly better than other models on plain fabric images.A learned dictionary was employed by Li et al. 19 to generate saliency maps, and a modified valley emphasis method was used to segment the defective regions.According to Zhang et al., 51 global and background features were characterized using visual saliency maps, and then a support vector machine (SVM) was utilized to classify fabric defects.However, the traditional visual saliency methods mainly rely on hand-crafted low-level features, which limits their capability to locate complete defects regions in cluttered background because of lacking the high-level semantic knowledge,.

Deep-learning-based salient object detection.
With the development of convolutional neural network (CNN) which has powerful capacity of extracting multi-scale features at different levels that contain both rich details and rich semantic cues, CNN-based SOD methods have achieved unprecedented success and been successfully applied in fabric defect detection.Xie et al. 45 divided detection process into model training and defect location, in which stacked denoising convolutional auto-encoder was used for image reconstruction in the training stage, and the detected images are divided into several blocks for localizing in the positioning stage.RBG (Ross B. Girshick) et al. 46 proposed the regional convolutional neural network using candidate regions plus convolutional neural networks instead of the traditional handcrafted design to detect defects.Liu et al. 52 used convolutional neural networks to detect fabric flaws based on a point-to-point approach, and this method performed well for the fabric image with the complex background texture.The deep saliency model developed by Wang et al. 53 incorporates self-attention mechanisms into a convolutional neural network to detect fabric defects.Liu et al. 54 used CNNs-based SOD model to capture fabrics features and combined with low-rank models to display the defects.However, these methods only concentrate on designing a delicate structure to fuse multi-level features, ignoring how to extract powerful and discriminative features and how to effectively fuse them, which cannot efficiently extract the discriminative features and may result in undesirable predication results.

Proposed method
The overall network architecture is shown in Figure 2, which mainly consists of four parts: 1) initail feature extraction block (IFEB); 2) context-aware multi-scale module (CAMS); 3) feature fusion refinement module (FFR); 4) multi-level deep supervision (MDS).The detailed process can be concluded as follows: First, IFEB pre-extracts the low-level, mid-level, and high-level defect features with different scales, and then optimize parameters through forward and backward feedback using the training fabric images and corresponding ground truth.Second, CAMS characterize the multi-scale multi-receptive-field defects features via a series of parallel multi-scale feature aggregation unit (MFAU) with cascaded dilation layers, which makes all level output features have the same channels.Then, FFR transfers the complementary and effective information from deeper layers to shallower layers through a set of attention fusion units (AFUs).In order to refine the defects boundaries for correctly segment the defects regions, the output features generated by AFU are fed into the auxiliary refinement unit (ARU).Finally, four side output predictions are generated and the first side output is chosen as the final prediction.Moreover, the MDS is of great importance in the training phase, which facilitates the optimization and the performance improvement of the proposed model.

Initial feature extraction block
Initial feature extraction block (IFEB) pre-extracts the multiscale defect features at different layers, which applies CNNs as the initial network.CNNs has been widely adopted in the computer vision field, in which the VGG 55 and ResNet 56 are the most popular backbone network, but both of them are not applicable to all object detection task.Specially, ResNet has outstanding classification performance, but it has too many layers and complex structure, which makes the model more difficult to train or test due to the large parameters and heavy computation load.In contrast, the VGG has excellent generalization capability and relatively simple architecture, which is relatively easier to train and deploy.Therefore, we choose the VGG16 as our backbone network, which is constructed only by 13 convolutional layers and 3 fully connected (FC) layers.In our model, we cast away the three FC layers and remove the last pooling layer of VGG16 for preserving the details of last convolutional layer in IFEB.

Context-aware multi-scale module
Due to the complex and diverse texture of the fabric, and the different sizes and irregular shape of the fabric defects, the fabric defect detection is more difficult and challenging.To realize the accurate detection, we have to utilize as much context information as possible for coping with the sales variation of the fabric defects.In CNN, the context information is closely related to the size of the receptive field.Convolution kernels with different sizes have different receptive fields, and the larger convolution kernel will cause the larger computation load.Fortunately, the dilated convolution provides a promising solution, which can capture more context information without increasing computation amount.Inspired by Yang et al., 57 Dong et al., 58 and Liu et al., 59 CAMS composed of a series of parallel MFAUs is designed to capture the context-aware multi-scale defects features and effectively fuse them for coping with the scales variation of defects, especially the tiny and liner defects.
The detailed structure of MFAU are shown in Figure 3.

Formally, let F H W C
∈ × × denotes the feature generated by one of the side-output of the FPS.We applied several dilated convolutional layers with different dilation rates d ≥ 1 for F to obtain the output features D H W C ∈ × × , which has the same spatial resolution with F but considering a larger receptive field.Specifically, there are five cascaded dilated convolution branches, in which the 1 1 × convolution is conductive to retain The spatial information of the original feature maps and the 3 3 × convolution layers with k different dilation rates are adopted for expanding the receptive field to obtain more contextual information.In order to integrate features with different receptive fields, a series of dense connections are added from top to down for enhancing the feature representation learning and then we obtain the output features 1 corresponding to the five convolution branches.The process can be described as follows: × convolution layer, respectively.Cat(, ) denotes the concatenation operation.
In addition, a residual connection is applied for avoiding information loss.Then, the output features of the five branches are concatenated with the input features to effectively fuse multi-scale context information for better adapting to scale variation of the fabric defects.Finally, the fused features are performed by a convolution layer followed by a batch normalization layer (BN 60 ) and nonlinear activation layer (ReLU 61 ).The final feature maps C can be described as: , , , , , )) where Concate is a combination of a convolution layer, a BN layer and a ReLU layer, and C H W C ∈ × × .Note that we set k = 4 and ϑ = { } 1 3 5 7 , , , in our experiments.

Feature fusion refinement module
The output features generated by MAFU may suffer from information redundancy, we should remove the redundancy to avoid the information interference for avoiding generating inaccurate predictions and improving the detection speed.Moreover, the different layers have different characteristics, so the complementary information between different layers plays an important role for accurate fabric defect detection.However, how to effectively utilize this cross-scale complementary information and how to enhance the discriminant capability of defects features are still two great challenges in fabric defect detection.More importantly, the defects boundaries are usually blurry due to the low contrast with the background texture, which will result in incomplete or even false segment results.To In AFU, the coordinate attention (CA 35 ) is firstly adopted for high and low level features to select important information and enhance the discrimination of defects features.
Second, the outputs obtained by CA are integrated through an element-wise sum operation.Then, the integrated features are concatenated with the high-level features through CA and then followed by the convolution layers.Finally, the output feature A i are fed into next AFU to integrate with lower level features, which is conductive to effectively utilize the complementary information of different levels for refining defects features progressively.Then, ARU is designed for refinement part, whose structure is illustrated in Figure 4, which takes  A H W C i ∈ × × as input.After performing a 3 3 × convolution layer, it reduces the channels number of Then the channels number of output feature B i is transformed from C / 2 to C through another 3 3 × convolution layer to generate D H W C i ∈ × × .More importantly, the D i is then split into a mask W H W C ∈ × × / 2 and a bias b H W C ∈ × × / 2 , which is concatenated by an element-wise multiplication and sum operation to generate the refined prediction maps.Finally, we squeeze the output channels to 1 via a 1 1 × convolutional layer for subsequent supervision.This process can be expressed by

R Conv LU WB b
Where Conv6 , Conv7 denotes the 3 3

Multi-level deep supervision
Recent works have pointed out that the effective integration of multi-scale features is essential for saliency detection. 62,63Specifically, low-level features contain rich details and high-level features contain rich semantic information, but the information may be diluted or removed after the poling layers.Therefore, for reconstructing this information, we have to further optimize the side outputs to compensate for the dilution and loss of information.Through the interpolation operation, MDS can generate supervised defects feature maps that are the same spatial resolution as the fabric defect image, hence accelerating the feature learning process and improving defect detection performance.Moreover, we apply the binary crossentropy loss with logits function as our loss function to mitigate the class imbalance problem, and it can be calculated by: Where l i denotes the loss of the i th side output prediction, y n ∈{ , } 0 1 represents whether the pixel belongs to the object or not in ground truth.x n ∈ ( , ) 0 1 is the probability of the pixel belonging to salient object and σ is the sigmoid operation.Note that different from HED 62 and DSS, 63 we only choose the first side-output as our final prediction instead of integrating all side outputs as the final prediction results.

Experimental setup
Evaluation datasets.We evaluate our model on our two built fabric datasets: plain dataset and pattern dataset.The plain dataset is comprised of 2200 images for training and 500 images for testing, which concludes many slender defects with low contrast with the texture background and tiny defects.The pattern dataset contains 5948 images for training and 500 images for testing, which has more complex background texture and irregular shape defects.For data augmentation, we adopt the horizontal flipping and rotation for data augmentation to alleviate over-fitting risk.Note that, the plain dataset mainly contains five types of defects: indentation, crease, off-line, stains, and holes.The pattern dataset contains six types of defects: yarn shedding, yarn breakage, yarn belt, cotton ball, holes, and stains.In general, our datasets are challenging, which contain more defects with low contrast with background texture, more small scale defects and more different types of defects.
Evaluation metrics.To evaluate the performance of our proposed method and other saliency models, in this paper, we adopt six popularly-used metrics: precision-recall (PR) curves, F-Measure curves, maximum F-Measure ( F β ), mean absolute error (MAE), structural similarity measure ( S m ), and E-measure( E m ).Binary maps are compared with corresponding ground truths to calculate PR curves, which is the closer to the upper right corner, the better the performance.Specially, binary maps and ground truth are utilized to compute Precision TP TP FP = + / ( ) which denotes the proportion of salient pixels that are correctly detected, and Recall TP TP FN = + / ( ) which is the ratio between salient pixels detected and salient pixels in the ground truth, where TP, FP, FN denote true-positive, falsepositive and false-negative, respectively.In general, the PR curves represent the average precision and recall rate over saliency maps of all images in the dataset.The pairs of precision and recall are of great importance in many occasions.Therefore, we adopt the F-measure as the overall performance evaluation metric, which is computed by the precision and recall.
where β 2 0 3 = .to stress on precision more than recall as suggested in Liu et al. 64 Note that the area is larger under the curve, the performance is better.
MAE 65 is defined as the average pixel-wise absolute difference between the predicted saliency map S and the binary ground truth L, which is the smaller, the better, it can be expressed as where W and H denote the width and height of a given image, respectively.S x y ( , ) and L x y ( , ) are the pixel values of the saliency map and the binary ground truth at (x, y), respectively.
S-measure ( S m ) evaluates the spatial structure similarity between the predicted saliency map and its corresponding ground truth, which is defined as where λ ∈[ , ] 0 1 is the balance parameter, and its value is typically set as 0.5.S o and S r are the objective-aware and region-aware structural similarity, respectively.E-measure ( E m ) 66 is proposed to evaluate two binary saliency maps, and takes into account both pixel-level and image-level properties simultaneously.
where θ ( , ) x y represents the enhanced alignment matrix.A better saliency detection model should have a larger F-measure, a larger S-measure, a larger E-measure, and a smaller MAE.In order to make a fair comparison, we report the values of all metrics to comprehensively evaluate the detection performance of all saliency models.Implementation details.Pre-trained VGG16 on ImageNet is used as backbone network.In the training phase, we randomly initialize the weights of each convolution layers in the network to a standard normal distribution.In addition, in order to avoid too large the weight value that is not conducive to model learning, we multiply the weight by a constant of 0.01, and the biases are initialized to 0. We implement our model on PyTorch platform and train on an NVIDIA V100 GPU for 43 epochs.Finally, the hyperparameters include: batch size (4), epoch (43), momentum (0.9), weight decay (5e-4), and learning rate (5e-5).The following experiments are conducted with all these parameters fixed.Additionally, the network optimizes the model using stochastic gradient descent (SGD) during training.We apply data augmentation like random flipping and multi-scale input images to alleviate over-fitting risk.All the images are resized into 352 352 × both for training and testing.
Visual comparison.In Figure 5, we present a visual comparison between the proposed method and the other approaches on our own two fabric datasets.Benefiting from the feature pyramid structure of CAMS, our model can detect different defects with different shapes and sizes due to the powerful feature extraction capability.It can be clearly seen that our model not only highlights the defects regions clearly of all kinds of fabric defects, especially the linear defects (see 5th, 7th, 8th, 11th), but also well outlines the boundary of the defects of all sizes, such as tiny defects (see 1st, 2nd, 6th, 9th, 10th, 12th), medium sized defects (see 3rd, 8th) and large defects (see 4th).Specially, as shown in the last row, we can see that some models detect defects with noise (see BASNet, PoolNet, and GateNet) or can not even identify the defects (see NLDF, R 3 Net, and AADFNet) when faced with tiny defects with low contrast to the background.However, our model can not only effectively suppress redundant information to accurately detect and locate the position of defects, but also outline the defects clearly and completely.Importantly, our model can also achieve excellent detection performance in low contrast occasions (see 7th, 9th, 10th, 11th, 12th), where the fabric defects are blend into the background texture.The reason why our model achieves the superior performance is that our proposed FFR can remove the redundancy, pay more attention on important and salient regions and refine features to force our model learn more useful features for generating more accurate and complete saliency maps.In addition, the MDS can enhance the generality of our model, making it more robust to variations in input images.The advantages we mentioned above make our results more closer to the ground truth, making them superior to other state-of-the-art saliency detection methods.
Quantitative comparison.We compare the quantitative evaluation results with the state-of-the-art saliency detectors on our two fabric test datasets in terms of F β , MAE, S m , and E m in Table 2.It is evident that our method consistently performs better than others on the two fabric datasets, showing its superiority for detecting defects.Especially, our method performs favorably against the others by a significant margin on the plain dataset, where the contrast between the defects and the background texture are low and has more liner and tiny defects.It shows that our method is capable of handling challenging images more effectively.In addition to these four measures ( F β , S m , MAE, and E m ), we also compare the PR curves and F-measure curves of the proposed model and others on our developed two fabric datasets in Figure 6.Comparing all competitors, the PR curve of our approach (red ones) appears to be the most outstanding.We have a much higher precision score than our competitors as recall score approaches 1, demonstrating that our method leads to lower false positives than others.Besides, the F-measure curve is also much higher than other methods with the threshold between [0, 1], which proves the effectiveness of our method.All in all, the experimental results demonstrate that our method is more robust and powerful than other approaches on our fabric datasets (Table 3).
Model performance.Our network only takes about 0.056 seconds to process a image on a NVIDIA GTX 1080Ti GPU.For fair comparison, all competing methods are reimplemented using the source code provided by the authors with the same hardware configuration.We can see that our model can run at a speed of 18 FPS (FPS: frames per second) when processing a 352 352 × image, which is relatively fast than the other models processing images with same size.Besides, Table 4 also reports the parameters comparison of various models, indicating that our model has the least number parameters and minimal computational resource overhead.When processing images of the same size, unfortunately, our method is slower than real-time saliency models, such as PoolNet, 51 since it aims to improve the accuracy of saliency detection.Therefore, we will consider the model size and running speed of our model as one of the future research directions of our work.

Ablation study
With different model settings, a series of experiments are conducted in this section to investigate the effectiveness of context-aware multi-scale module (CAMS), feature fusion refinement module (FFR), and multi-level deep supervision (MDS).All models in this section are trained on the augmented fabric datasets and share the same hyperparameters described in the subsection 4.1.We firstly conduct a series of experiments with different popularly used backbone networks, the results are shown in Table 3.As can be seen that, our selected backbone network has achieved the best results on two fabric datasets.Figure 7 shows the feature maps of different modules, which indicates that all modules are beneficial for performance improvement.Table 5 presents the effectiveness of CAMS and FFR in terms of F β , MAE, S m , E m .From Table 5, as can be seen that our approach achieves best performance with all components (i.e., backbone, CAMS, and FFR), which illustrates that all components of our model are necessary to produce accurate detection results.Moreover, we investigate the effectiveness of the applied multi-level deep supervision (MDS).

Effectiveness of CAMS.
We introduce the CAMS to expand the receptive field without increasing any computation load, which is conductive to the proposed model to capture more context-aware information for coping with the variation of defects scales.Furthermore, we evaluate the performance of CAMS with respect to different dilation rates ϑ and fusion strategy in each MAFU in order to provide an Table 2. Comparison of different methods on three metrics including F β (higher is better), MAE (lower is better), S m (higher is better), and E m (higher is better).

Methods
Plain dataset Pattern dataset   even more comprehensive analysis.Note that here ϑ denotes a set of dilation rates (Table 6).First, we investigate the effectiveness of different dilation rates ϑ utilized in CAMS.Owing to the first convolu- tion layer is designed to retain the spatial information of original feature maps, we set the dilation rate as 1 and only change the dilation rate of the rest convolution layers.We try five settings for ϑ : {1, 1, 1, 1}, {1, 2, 3, 4}, {1, 2, 4, 6}, {1, 3, 5, 7}, and {2, 4, 6, 8}, whose results on plain dataset are reported in Table 4.It can be easily seen that ϑ = { , , , } 1 3 5 7 achieves the best performance.Although S m score of ϑ = { , , , } 1 1 1 1 is higher 0.5% than that of ϑ = { , , , } 1 3 5 7 , ϑ = { , , , } 1 3 5 7 gains best scores compared with other settings in terms of the rest metrics.As such, we choose ϑ = { , , , } 1 3 5 7 in all our experiments.In addition, we study the fusion strategy of MFAU in CAMS.The concatenation operation is a popularly utilized technique for feature aggregation, which is conductive to extend the feature matrix dimension.In this model, we use the concatenation operation to fuse multi-scale      operation to summation operation when performing feature aggregation in MFAU and the feature maps after summation operation are shown in Figure 8(c) and (g).As can be seen in Figure 8, the feature maps through summation operation are with unclear defects boundaries and incomplete defects regions compared with the feature maps through concatenation operation.Furthermore, it is clear from Table 7 that concatenation operation realizes better results compared with summation operation.Therefore, we can make a conclusion that compared to the concatenation operation, the summation operation cannot well enhance the discrimination and representation ability of defects features between different dilation convolution layers in MFAU.With the consideration of the above analysis, we adopt the concatenation operation in this paper to integrate the multi-scale features of MFAU.
Effectiveness of FFR.To prove the effectiveness of FFR and its two main components, we compare three variants and the results are reported in Table 6 and Figure 9. From Table 6, it can be seen that after adding AFU and ARU on the basis of CAMS, various indicators scores are also increasing steadily, except for MAE score, which has proved the effectiveness of the two main components of FFR.Especially, we can see from Figure 9, the feature maps generated by FFR are more clear than FFR without ARU, which has more clear boundaries and less texture information.Furthermore, from the fifth column of Figure 7, we can see that the feature maps after FFR are much clear and accurate compared with the first two columns.Therefore, we can make a conclusion that the feature fusion part of our proposed FFR can pay attention important defect regions and effectively aggregate cross-level features, which is conductive to enable the model learn more useful features to further enhance the feature representation capability of our model.Additionally, the refinement part can further refine the features to generate more accurate saliency maps with clear boundaries (Table 8).

Effectiveness of MDS.
To demonstrate the effectiveness of adopted multi-level deep supervision (MDS), we compare it with the single-level deep supervision (SDS).Note that only the supervision learning is different and the remaining steps are the same.Compared with the MDS utilizes deep supervision learning for each side output prediction map, SDS means the loss function is only supervised on   the last layer.Importantly, MDS means that we use more loss function to supervise and guide our model to locate the defect regions more accurately and learn useful defect features for model training.Here, we show the quantitative results and visualization saliency maps in Table 9 and Figure 10.We can obviously see from Figure 10 that compared to Ours(SDS), the side outputs of four layers obtained by Ours(MDS) are able to locate the defects regions and sketch the defect boundaries well simultaneously.These features generated by Ours(MDS) will be beneficial for better defects localization.In addition, Table 9 presents the experimental results of Ours(SDS) and Ours(MDS) in terms of F β , MAE, S m , and E m scores.Obviously, Ours(MDS) achieves the better results compared with Ours(SDS), which proves the correctness and effectiveness of the MDS.

Conclusion
In this paper, we propose a novel end-to-end saliencybased fabric defect detection network named CPA 2 Net, which is mainly comprised of two components: contextaware multi-scale module (CAMS) and feature fusion refinement module (FFR).CAMS composed of several parallel multi-scale feature aggregation unit (MFAU) with dilated convolution layers is proposed to capture the context-aware multi-scale multi-receptive-field information for coping with the variation of defects scales.To further enhance the discriminative capability of defects features, FFR is proposed to effectively fuse features and refine them, where the AFU with feedback mechanisms is designed to remove redundancy and exploit the complementary information of different layers and the ARU is designed to further refine the features generated by AFU and meanwhile highlight the defects boundaries.Moreover, we adopt a multi-level deep supervision (MDS) to supervise each side output for guiding training and generating more accurate and complete prediction maps.Extensive experimental results demonstrate that our proposed model achieves state-of-the-art performance both in quantitative and qualitative evaluations.More experiments will be conducted in the future to further improve the accuracy and stability of this model or to compress the model for getting a lightweight model while remaining the existing performance of our model.We hope our proposed method could provide promising future research directions in fabric defect detection and other related research fields.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figure 2 .
Figure 2. Architecture of our proposed network.

Figure 3 .
Figure 3. Detailed structure of MFAU.C denotes the concatenation.

Figure 5 .
Figure 5. Visual comparisons of the proposed method and the state-of-the-art methods.

Figure 6 .
Figure 6.PR curves and F-measure curves on two fabric datasets.

Figure 7 .
Figure 7. Visual comparison saliency maps for showing the effectiveness of each model.The first column is the original image, the second column is the ground truth, and the third to last columns are the output results after FPS, CAMS, FFR, respectively.

Figure 9 .
Figure 9. Visual feature maps of AFU with ARU and without ARU: (a) AFU only and (b) AFU + ARU.

Figure 10 .
Figure 10.Visual feature maps and saliency maps of different supervision learning techniques.

Table 1 .
Advantages and disadvantages of previous works.correlationamong the patterned fabric image for defect detection.However, the structural methods are low in detection rate, and it is only suitable for fabric texture images with extremely regular texture structure.

Table 3 .
Results comparison with different backbone networks on plain and pattern datasets.

Table 4 .
Model performance compared with other models.
FPS: frames per second.

Table 5 .
Quantitative results for different networks presented in the ablation study on plain dataset.
multi-receptive-field feature progressively.Figure8(d) and (h) shows the visual saliency maps after the concatenation operation.It can be clearly seen that the model with the concatenation operation in MFAU can locate fabric defects accurately and depict the defects boundaries clearly.Moreover, all the experimental results presented in this paper testify the effectiveness and superiority of the MAFU with the concatenation operation, which is capable of characterizing the contextual multi-receptive-field features of different levels for fabric defect detection.The summation operation is another commonly used feature fusion technique, which is able to fuse different responses of different features while preserving the original information at the same time.We change the concatenation

Table 6 .
The comparisons of parameters ϑ in MFAU on plain dataset in terms of F β , MAE, S m , and E m scores.
The best results are shown in bold.

Table 7 .
F β , MAE, S m , and E m scores on the plain dataset under different fusion strategies in the MFAU.
The best results are shown in bold.

Table 8 .
F β , MAE, S m , and E m scores on the plain dataset under different variants in the FFR.
The best results are shown in bold.

Table 9 .
F β , MAE, S m , and E m scores on the plain dataset under different supervision learning techniques.
The best results are shown in bold.