Nonlocal spatial attention module for image classification

To enhance the capability of neural networks, research on attention mechanism have been deepened. In this area, attention modules make forward inference along channel dimension and spatial dimension sequentially, parallelly, or simultaneously. However, we have found that spatial attention modules mainly apply convolution layers to generate attention maps, which aggregate feature responses only based on local receptive fields. In this article, we take advantage of this finding to create a nonlocal spatial attention module (NL-SAM), which collects context information from all pixels to adaptively recalibrate spatial responses in a convolutional feature map. NL-SAM overcomes the limitations of repeating local operations and exports a 2D spatial attention map to emphasize or suppress responses in different locations. Experiments on three benchmark datasets show at least 0.58% improvements on variant ResNets. Furthermore, this module is simple and can be easily integrated with existing channel attention modules, such as squeeze-and-excitation and gather-excite, to exceed these significant models at a minimal additional computational cost (0.196%).


Introduction
By interleaving a series of convolutional layers with nonlinear activation functions and downsample operators, convolutional neural networks (CNNs) 1 are able to produce robust representations that capture hierarchical patterns and attain global theoretical receptive field. Thus, CNNs become the paradigm of choice in many computer vision applications, such as image classification, 2-5 object detection, 6 semantic segmentation, 7 and regression. 8,9 In recent years, attention mechanisms have been a new remedy for feature recalibration by capturing contextual longrange interactions. The attention mechanism started from the introduction of an attention module to draw global dependencies of inputs in neural machine translation, 10 and then, the landmark works about self-attention modules 11 sets a new standard in this field.
Self-attention mechanism is to measure the compatibility of the query and key content pairwise relations. In this field, one approach is that nonlocal network (NLNet) 12 presents a self-attention map to model the correspondence from all positions to each query position. Meanwhile, simplified attention models use a query-independent attention map for all query positions. In the recently proposed convolutional block attention module (CBAM), 13 a single spatial attention map is multiplied back to the channelattention tuned feature maps for adaptive feature refinement. Our nonlocal spatial attention module (NL-SAM) builds on the benefits of NLNet with effective modeling on global contextual information and CBAM with an efficient attention map generation. NL-SAM incorporates three interdependent operations: context collecting, transformation, and distribution. Context collecting performs feature aggregation to obtain highly compressed global information in spatial dimension. Transformation provides a nonlinear way to recalibrate feature responses. Then, attention map is distributed to each location in the convoluted feature maps.
Our NL-SAM is general and efficient in terms of added parameters, it can be integrated into any CNN architectures individually while maintaining end-to-end trainable or existing channel attention modules with negligible overhead to serve as the complementary attention. We incorporate NL-SAM within ResNet 4 to validate through experiments on CIFAR-10, CIFAR-100, and ImageNet-1K classification datasets. In particular, as shown in Figure 1, deep CNNs embedded with our NL-SAM introduce very few additional computations while bringing notable performance gain. For example, for ResNet50 with 25.53M parameters and 6.995G floating point operations (FLOPs), the additional parameters and computations of NL-SAM are only 0.03M and 0.007G FLOPs, but the accuracy has been improved by 0.58%. And gather-excite network (GENet) 14 combined with NL-SAM reaches the highest accuracy (75.6%) with only 0.085% additional FLOPs (more details are given in Table 7).
The rest of this article is organized as follows. The second section analyzes related works about attention mechanisms. The third section describes the architecture of the proposed NL-SAM and analyzes the relationship between NL-SAM and other attention modules. The fourth section verifies the effectiveness of NL-SAM throughout extensive experiments with various baseline models on multiple benchmarks. The fifth section concludes the article.

Convolution network architecture
The convolution layer has been the dominant visual feature extractor in computer vision. Recent advances of convolution networks focus on capturing long-range dependencies. 3,4 ResNet 4 solves the degradation problem caused by the increased depth of neural networks, thus, it is able to deliver information between distant positions by simply increasing the network depth. After that, another important direction is to modify the spatial scope for aggregation by enlarging the receptive field, such as atrous/dilated convolution. 7

Attention models
In vision, the key and query refer to visual elements, such as image pixels. Regular convolution can be deemed as a special instantiation of spatial attention mechanism. Given a query element, predetermined positional offsets of key elements are sampled. In the recently proposed attention mechanisms, there are two major patterns as follows.
Query-independent attention models. These models are irrelevant to the query content, and they only capture salient key contents, which should be focused on for the task. By computing a global attention map and sharing this map for all query positions, these models are very efficient. For example, squeeze-and-excitation network (SENet) 15 and GENet 14 perform rescaling to different channels to recalibrate the channel dependency. However, they miss the spatial axis, which is also important for inferring accurate attention maps. Bottleneck attention module (BAM) 16 and CBAM 13 introduce spatial attention using convolution in a similar way with a channel attention mechanism. However, in these spatial attention models, feature transformation is performed by convolution, which results in suboptimal effect for global context exploitation in CNNs.
Query-specific attention models. Motivated by their success in natural language processing (NLP) tasks, self-attention mechanisms have also been employed in computer vision applications, such as image recognition, 17-21 relational reasoning among objects, 22,23 image segmentation, 17,24 scene parsing, 25 and video recognition. 26 Wang et al. 27  overheads caused by the heavy map generation process, dual attention network 20 appends two parallel 2D and 1D attention modules on top of dilated fully convolutional networks (FCN). 28 However, they all have an obvious shortcomingself-attention models contain a large number of matrix multiplication operations, which increases the computational burden.

Proposed method
In this section, we first overview two major attention modules NLNet 12 and CBAM, 13 and conclude a concise formula that summarizes these attention models. Then, we introduce the NL-SAM and describe its key components.
In addition, we analyze the relationship between our NL-SAM and other attention modules.

Overview of attention modules
A representative query-independent attention model is CBAM, 13 which is formulated as where X 2 R W ÂHÂC is the input feature block (W, H, and C are width, height, and channel number, respectively), and denotes element-wise multiplication, Z is the final refined output. CBAM first computes the channel attention In this function, W 1 and W 0 are multilayer perceptron weights, 8j enumerates all positions in the j'th feature map, and N p is the number of positions in this feature map. maxðÁÞ calculates the largest value of the inputs, and s refers to the sigmoid function. Based on the channel attention output Z 0 , the spatial attention model is calculated as where 8j denotes 1 j C and W 7Â7 denotes convolution with kernel size 7. NLNet, 12 a typical query-specific attention module, can be expressed as where i is the index of query positions and 8j enumerates all possible positions. f ðX i ; X j Þ denotes the relationship between positions i and j, and CðX Þ is a normalization factor. After simplifying these attention models, a general attention modeling framework can be summarized as the following formula i. q 8j ðX j Þ represents a context collecting module that aggregates features at all locations to obtain global context features through weighted calculations and 8j represents the traversal value of the channel dimension or spatial dimension; ii. dðÁÞ denotes the nonlinear feature conversion method; iii. FðÁ; ÁÞ denotes the attention distribution function that broadcasts the global context features to each position.

Nonlocal spatial attention module
The structure of NL-SAM is shown in Figure 2, which can be roughly divided into three parts: context collecting, transformation, and distribution. According to equation (6), our NL-SAM is formulated as In detail, our NL-SAM consists of i. channel reduction for information aggregation and local average pooling on feature map for context collecting, as q k ðÁÞ defined in equation (9); ii. bottleneck transformation capture pixelwise dependencies, as d r ðÁÞ defined in equation (10); iii. matrix multiplication for feature fusion, as FðÁ; ÁÞ defined in equation (11).
The first part is context collecting for global information aggregation. For simplicity, channel average pooling is used here. For a given feature block X 2 R W ÂHÂC , the output X cp is computed as On the basis of channel compression, we further utilize average pooling to aggregate the channel-pooled feature maps. Experimental results in Table 2 demonstrate that pooling of adjacent pixels in early stages where NL-SAM inserted provides translation invariance needed for classification. The final aggregated feature map X q is calculated by where k is a hyperparameter of the pooling operation with different values at different stages, i 0 and j 0 are the traversal values of the context collecting map, where 0 i 0 W k and 0 j 0 H k . The second part-feature transformation-has the largest number of parameters. Thus, a bottleneck transformation module is applied, which significantly reduces the number of parameters from W H Á W H to 2 Á W H Á WH r , where r is the bottleneck ratio and WH r denotes the hidden neuron number of the bottleneck. With default reduction ratio set to r ¼ 8, the number of parameters and computational overheads for transformation module can be reduced to 1 4 . More results on different values of bottleneck ratio r are given in Table 1.
Then, the 2D pooled feature map Each point in the feature map is treated as an independent node, the trainable weight matrix learns the "importance" or saliency of each point corresponds to the rest in the feature map by backpropagation. The weight matrix is querysensitive and can learn exclusive relationships since every query has its own attention map. This is important because of the pooling operation in the original feature map and the bottleneck reduction operation in this part. Inspired by SENet, full connection acts as an instantiation here where "fc" means full connection. Finally, the 1D results are restored to 2D, namely, the transformation results The main calculation cost of NL-SAM is introduced by this part. The FLOPs are WH rk 2 , which are determined by the width W and height H of the input feature map, and pooling kernel size k and bottleneck reduction ratio r.
In the last part, fusion function distributes the global context to the features of each position. While the transformed feature map has a width/height that is 1/k of the original size, experimental results in Table 3 draw a conclusion that nearest neighbor interpolation is the best choice to restore the size. The finally refined feature map is obtained by  where denotes elementwise multiplication, and the same sized 2D spatial attention feature map X 0 d 2 R W ÂH is to emphasize or suppress responses in different locations.

Relationship to other attention modules
We make a theory comparison for deeper analysis on the relationship to other attention modules. To aggregate global context from all positions, SENet 15 utilizes global average pooling, which assumes that every point in a feature map contributes the same to the whole, there is no spatial attention in this module, so it can serve as the channel aggregation block of our NL-SAM. GENet 14 and BAM 16 utilize depthwise global convolution or dilated convolution to obtain a large receptive field, since convolution finally sums up as a single result, it is only one point on our 2D attention map. In the fusion stage, SENet and GENet utilize rescaling to recalibrate a channel as a whole unit by one single parameter; BAM 16 utilizes broadcast elementwise addition; CBAM 13 is almost the same as us, the most sophisticated way is elementwise multiplication.
Our NL-SAM follows the NLNet 12 by acquiring a query-sensitive attention map, but we need to specify that there are two prominent features in this block. Firstly, we use dimension reduction to decompose channel attention and spatial attention. Secondly, we use the combination of k Â k pooling and upsampling to reduce the calculation in global information collection and to maintain translation invariance. And, moreover, NL-SAM is a simplified version and not referred to selfattention but salience-attention.

Experiment and analysis
In this section, we evaluate the performance of the proposed NL-SAM on a series of benchmark datasets, including CIFAR-10, CIFAR-100, and ImageNet. 29 Our experiments contain three parts. Firstly, ablation experiments are conducted to determine the optimal value of hyperparameters. Given limited computation resources, CIFAR-10 and CIFAR-100 are selected as experimental datasets. Our module surpasses ResNet on CIFAR datasets both in efficiency and accuracy. Secondly, we analyze the reciprocal relation of NL-SAM with other attention modules. Finally, to be more convincing, we compare classification results with ResNet50 variants on ImageNet. And evaluate original ResNets based on the official code of TensorFlow-Slim image classification model library. 30 Some parameters are modified due to limited computing resources, especially the batch size. For fair comparison, all the relevant attention models are reimplemented by Ten-sorFlow-Slim.

CIFAR and analysis
Implementation details. We follow the simple data augmentation strategy in the literature 4 for training: images are randomly flipped horizontally and zero-padded on each side with four pixels before taking a random 32 Â 32 crop, mean and standard deviation normalization are also applied. ResNet110 and ResNet164 preactivation residual models 5 are trained on one RTX2080ti, batch size is set to 128, momentum and weight decay are 0.9 and 0.0003 and the same weight initialization as the reference. 5 The initial learning rate is set to 0.1 with warming up at the first 400 iterations, divided by 10 at 120, 190 of 250 epochs in total.
Reduction ratio. The bottleneck compression rate r is intended to reduce redundancy in parameters. The value of average pooling kernel size k is fixed to 0, different values are set to determine the optimal value of r. As given in Table 1, r ¼ 8 achieves a good balance between accuracy and complexity. So, this value is used in the rest of the experiments.
Effect on different inserting stages and pooling kernel size. The influence of NL-SAM is investigated on different stages (here the term "stage" is the same as defined in the literature 4 ) based on different inserting stages and different values of k, the accuracy, computational cost, and parameter numbers, as given in Table 2.  Based on the results listed in Table 2 and the corresponding Figure 3, we have the following observations: (1) deeper stage which has more semantic information benefits more from NL-SAM; (2) insertion at all stages are not mutually redundant, they can further bolster the accuracy; (3) pooling is not only to reduce the calculation cost but also to keep translation invariant in classification, since our module is position-sensitive, weight sharing within the pooling kernel is necessary to counteract this.
At the same time, the interpolation experiment is conducted in stage 2 with pooling kernel size of 2. The results of Table 3 verify that nearest-neighbor interpolation is the best choice. In fact, due to the previous average pooling of the feature map, the same weights are naturally shared within the k Â k block in X 0 d (equation (11)).
Integration with different architectures. We evaluate our NL-SAM on the popular ResNet backbones, plain ResNet110 5 and ResNet164 5 with bottleneck. According to the above experiments, the structure and hyperparameters are determined, as given in Table 4. NL-SAM embedded in stage 1 of the network consumes more resources, but the effect is inferior to the later stages. Thus, it is the best choice to insert NL-SAM into stage 2 and stage 3 with k set to 2, and the results are reported in Table 5. It can be found that NL-SAM consistently improves performance across different depths with a small increase in FLOPs.
Combination with other attention modules. Comparative experiments are conducted on CIFAR-100 by integrating NL-SAM with SENet and GENet to serve as complementary spatial attention. Average pooling is used in SENet, global depthwise convolution is used in GENet, and bottleneck reduction ratio is set to 4 with sigmoid activation in both SENet and GENet. In Figure 4, the comparison of parameter numbers and accuracy of different models are explicit. It shows that NL-SAM can further improve the accuracy of SENet and GENet by about 0.2% with less than 0.002M parameters added. Furthermore, we replace spatial attention module in CBAM with NL-SAM. In the third column of the figure, we can see that this replacement variant outperforms the original CBAM by 0.32%. It suggests that the designed query-sensitive spatial attention in NL-SAM is superior to query-independent spatial attention in CBAM.

ImageNet and analysis
Implementation details. This dataset contains 1.2 million training images and 50k validation images, each is cropped to 224 Â 224 pixels. ResNet50 is chosen as a base model  and trained from random initialization. 5 SGD is used with momentum as 0.9 to optimize, batch size is set to 128. The initial learning rate is set to 0.1 and is reduced by a factor of 10 for every 30 epochs, all models are trained for 100 epochs.
Classification results on ImageNet. Based on the ablation experiments on CIFAR, the hyperparameters of ResNet50 for ImageNet are defined in Table 6. Due to the large feature map size and insertion stage position study on CIFAR, we add NL-SAM in stages 3, 4, and 5. We further analyze the effectiveness of NL-SAM by the splitting and combination experiments. The results of the final models are given in Table 7. It can be found that our module is able to effectively improve the accuracy even on large datasets. Compared with SENet and GENet, NL-SAM is slightly inferior but still has a notable improvement on ResNet50 baseline. The best result is produced when NL-SAM is integrated with GENet. CBAM seems to be unsteady, which does not reach the baseline result, but when integrated with NL-SAM, the accuracy is improved by 0.22%. These experiments verify the consistent improvement of NL-SAM when integrated with channel attention modules. NL-SAM exhibits the ability to use without bells and whistles.

Analysis and discussion
We compare the visualization results of NL-SAM inserted to ResNet50 or combined with other modules. The grad-CAM 31 visualization is calculated for the last convolution layer outputs, ground-truth label is shown on the top of each input image and P denotes the softmax score of each network for the predicted class. As shown in Figure 5, according to P's score, ResNet50 misclassifies siamese cat and obtains the lowest score. ResNet50 þ GE þ NL-SAM always gets the highest score, ResNet50 þ NL-SAM and ResNet50þ GENet both have their own advantages. We  can clearly see that NL-SAM integrated networks have a good incentive or suppression in some regions. For example, they focus more precisely on the head of the merganser and pay less attention to the water, and attention on the whole body of siamese cat shows the benefit of global visual field.

Conclusion
We propose the NL-SAM, which is a new method to enhance the representation ability of the network.
Through modeling on global contextual information, our module learns where to focus or suppress, and refines intermediate features.
To verify its effectiveness, extensive experiments were conducted on three benchmark datasets. The results show that NL-SAM improves classification accuracy by 0.98% on CIFAR-10, 1.01% on CIFAR-100 when embedded in ResNet110, and 0.58% on ImageNet when embedded in ResNet50. What is more, the combination of NL-SAM and GENet improves accuracy on ImageNet by 1.05% with only 0.038G FLOPs added. For future works, we plan to devise a flexible way to switch between NL-SAM and convolution in a layer.