Detection of ship targets in photoelectric images based on an improved recurrent attention convolutional neural network

Deep learning algorithms have been increasingly used in ship image detection and classification. To improve the ship detection and classification in photoelectric images, an improved recurrent attention convolutional neural network is proposed. The proposed network has a multi-scale architecture and consists of three cascading sub-networks, each with a VGG19 network for image feature extraction and an attention proposal network for locating feature area. A scale-dependent pooling algorithm is designed to select an appropriate convolution in the VGG19 network for classification, and a multi-feature mechanism is introduced in attention proposal network to describe the feature regions. The VGG19 and attention proposal network are cross-trained to accelerate convergence and to improve detection accuracy. The proposed method is trained and validated on a self-built ship database and effectively improve the detection accuracy to 86.7% outperforming the baseline VGG19 and recurrent attention convolutional neural network methods.


Introduction
Ships can be detected in photoelectric images by various analog or digital imaging systems. Traditional ship photoelectric images are mostly acquired by satellite remote sensing observation systems. However, they are vulnerable to cloud occlusion and the observation angle is always overlooked, leading to deficiencies in robustness and effectiveness in ship detection. [1][2][3] With the development of unmanned aerial vehicle (UAV), it becomes more convenient to acquire ship's photoelectric image using high-performance airborne photoelectric equipment. The sharpness of the image is greatly improved, and, through control, multiple angular images of ships can also be obtained. As a result of these advances, rapid and accurate identification of ships becomes possible with potential important applications in early warning and accident prevention. In the meantime, since there are many types of ships, further differentiation of the types of ships can also be important. However, it remains a challenging task to classify the ship types, 4 which is essentially a finegrained image classification problem. 5 1 Recently, deep learning algorithms have been increasingly used to improve the ship image detection and classification accuracy. Most of these methods use convolutional neural networks (CNNs) to extract the image features, 6 to locate the target position, and to identify the types of ships. These methods can be divided into two categories: strong supervision and weak supervision. 7 In the strong supervision mode, the model not only uses image labels, but also relies on the target bounding boxes and component points to assist feature learning. Z Ning et al. 8 proposed a part R-CNN (regions with convolutional neural networks) algorithm, which detects the target category by setting component modules. Wei et al. 9 proposed a mask CNN algorithm, which locates targets through a fully connected network (FCN) and divides them into two masks for joint discrimination. In the weak supervision mode, the classification model relies on image labels alone, possible with a compromised performance compared with strong supervised learning. However, no bounding box of the target is needed, and thus it is feasible to collect large-scale training data and easier to train the network. Simon and Rodner 10 proposed the weak supervised constellations algorithm, which is able to obtain the feature map by convolution operation and determine the position and classification by calculating the gradient of the feature map. Lin et al. 11 proposed a bilinear CNN algorithm, using two CNNs for target location and discrimination, respectively. The two networks can optimize each other and improve the overall classification effect.
In 2017, Fu et al. 12 proposed a recurrent attention CNN to solve the classification problem of fine-grained image databases. The network achieves the optimal level on three mainstream fine-grained image sets: CUB Birds, Stanford Dogs, and Stanford Cars, achieving 3.3%, 3.7%, and 3.8% improvement in accuracy, respectively. However, the network cannot make good use of the global information of the image, and the shape of the feature area is always the same. Therefore, in this article, an improved recurrent attention convolutional neural network (RA-CNN) is proposed to detect the photoelectric ship targets. The scale-dependent pooling (SDP) and multi-feature attention proposal network (MF-APN) algorithms are used to improve the network's recognition accuracy for fine-grained images. At the same time, the use of the MF-APN algorithm will increase the amount of network calculations and reduce the real-time performance of detection.

CNN development
CNN has been the cornerstone in the deep learning field. Through a series of convolution and pooling operations in CNN, 13 the network can learn to generate the target location box and to classify the target category given a training dataset. In 1998, The LeNet-5 model was proposed by LeCun et al., 14 which established the modern structure of CNN. In 2006, Hinton and Salakhutdinov 15 proposed a multi-hidden-layer neural network demonstrating better feature learning ability, and its training complexity can be effectively alleviated by initializing each layer. In 2012, AlexNet won the ImageNet competition, 16 proving the practicability of deeper CNNs. Nowadays, a variety of complex CNN-based networks are designed to improve the detection accuracy and shorten the detection time, such as faster R-CNN, 17 Yolo, [18][19][20] and SSD (single-shot detector). 21 The VGG19 network 22 is a deep convolutional neural network (DCNN), which uses smaller convolution and pooling kernels, making it easier to adjust and apply.

Ship detection
Due to the development of CNN, many CNN-based ship identification methods have emerged in recent years. In 2017, Kang et al. 23 proposed a contextual region-based CNN, which can effectively detect ship targets in synthetic aperture radar (SAR) images. In 2018, Zhao et al. 24 proposed a coupled CNN network for small and dense ship identification, which demonstrates the ability of CNN to detect surface targets. In the same year, Oliveau and Sahbi 25 proposed semisupervised deep attribute networks, which enhanced the recognition effect of CNN on fine-grained images through shallow and deep attributes. In 2019, Zhang et al. 26 improved faster R-CNN, reduced the impact of environmental factors on target detection, and achieved good results in real-time surface target detection. In the same year, Lin et al. 27 proposed the squeeze and excitation mechanism to further improve the detection capability of faster R-CNN for ships in SAR images.

RA-CNN
RA-CNN consists of three cascading scale subnetworks with the same network structure, but different and independent network parameters. Each scale sub-network has two different kinds of networks, namely, a VGG19 network for classification and an attention proposal network (APN) for target location. In each scale level, the image features are extracted from the input image and classified by VGG19; then the APN locates the feature area based on the extracted features and further cut out and enlarge the feature area as the input to the next scale network. By fusing three scales' outputs, RA-CNN can determine the category of the fine-grained image. 28 The advantage of RA-CNN is that it can gradually focus on the feature area in the learning process through its multi-scale architecture. In addition, there are two loss functions designed in the RA-CNN, namely, classification loss and inter-scale loss. Through cross-training, the classification and target localizing networks promote each other, leading to a quicker convergence of the model weights.

SDP
In 2016, F Yang et al. 29 proposed using the SDP algorithm in their paper to improve the accuracy of object detection based on CNN. The output of the last pooling layer of CNN is usually used to determine the type of objects in the image. However, this may lead to a problem that if the target feature area is too small, multiple convolutions may diminish the discriminating power of the features and affect the final classification. The SDP algorithm extracts features from different CNN convolutional layers according to the size of candidate regions. For small candidate regions, the features are selected from the front convolutional layers, and, for large candidate regions, the features are selected from the back layers.

Method
Labeling fine-grained images needs domain knowledge and careful evaluation because of their high similarity, which can be very time-consuming and labor intensive. RA-CNN, through weak supervised learning, only needs the category label of the image to classify the class and does not require the corresponding bounding box information to locate the feature area. 30 However, the shape of the feature area generated by APN in RA-CNN is fixed and thus not conducive to the learning of various special geometric appearance feature areas. In other words, APN ignores the direct connection between the feature area and VGG19 after location. Therefore, in this study, we use the SDP and multifeature area joint discrimination method to improve the model.

Preprocessing
To standardize the input images to feed in the VGG-SDP network, we adjusted the train and test images to the 224 3 224 specification to meet the input requirement of the network. The size of the image should not be less than 80 3 80; otherwise, the network detection effect will decrease. Each image is an RGB threechannel map with pixel value ranging from 0 to 255. To reduce the problem of training non-convergence caused by the singular pixel value and to speed up the convergence of the initial iteration of training, 31 it is necessary to assign the pixel values to ½À1, 1, as shown in equation (1) x where i, j 2 ½0, 223, and x i, j and x 0 i, j represents the unmodified and modified pixel values, respectively.

Network architecture
The proposed network is designed to detect and classify ship targets using the preprocessed photoelectric images as the input. In the results, we render the target in a square box and display the ship category. The ship target recognition flowchart is shown in Figure 1: The improved RA-CNN network still adopts the original three-scale architecture. 12 We replace the target classification network VGG19 in each scale layer with a VGG-SDP network. Then, the feature location network APN is replaced with an MF-APN network. A generalized overview of the proposed network is shown in Figure 2. The process of feature extraction and target location is detailed as follows: 1. Feed the input image to the classification network v 1 in the first scale layer for feature extraction. 2. Send the output of the fifth pooling layer P 5 in the network v 1 to the location network m 1 , and obtain the target position and size N . 3. Choose the most appropriate maxpooling layer's output in v 1 according to the target size N ; after the full connection operation, the softmax function is used to obtain the prediction label Y (1) of the first scale layer. 4. Subject to the target position and size N in Step 2, the network m 1 will extract M feature maps (in this article, M is set to 3) and enlarge them to 224 3 224 as the input of the next scale layer. 5. Send M feature maps to the classification network v 2 in the second scale layer for deeper feature extraction, and Steps 2 and 3 are repeated for each feature map; then, we converge the prediction probabilities of the M feature maps into the prediction labels Y (2) of the second scale layer. The location network in the second scale layer only uses the normal APN, so we still have M feature maps generated into the third scale layer. 6. Repeat Steps 2 and 3 to fuse the prediction probabilities of the feature maps into the prediction label Y (3) of the third scale layer. 7. The final target is positioned as a square feature box in the first scale layer, and the target category is the fusion of the predicted labels of the three scale layers. The specific flowchart is shown in Figure 2.

VGG-SDP classification
Due to the robustness and superior feature extraction capability, VGG19 is used as the basic classification network in RA-CNN, which consists of 16 convolutional layers, 5 maxpooling layers, and 3 fully connected layers, where the convolutional layers and the maxpooling layers are divided into 5 convolutional blocks. 22 The fully connected layers use the output of P 5 for the softmax operation to obtain the target's prediction label.
Sizes of the feature areas obtained by the APN can be variable. However, the classification network always determines the category based on the output of P 5 . When the APN output feature area is small, P 5 cannot describe the image features well due to the fact that multiple convolutions retain less information. Therefore, by incorporating the SDP method into the VGG19 network, the network can intelligently select the most appropriate convolution block's output for classification and enhance the network's learning ability for small ship features. The network architecture diagram is shown in Figure 3.
The input image is first completely extracted through the classification network. Then the MF-APN network calculates m which is the size of the feature area, and the VGG-SDP network will select the best result of pooling among the three pooling layers according to N to represent image I for classification. The selection criteria are as follows 0\N ł 64 P 4 (I), 64\N ł 128 P 5 (I), 128\N ł 224 ð2Þ Generalized overview of the proposed network. p t represents the predicted probability of the real category; L inner represents the classification loss of each scale, which is the result of the cross-entropy operation between the real category label and the predicted category label; L scale represents the loss between adjacent scales.
where P 3 (I) and P 4 (I) represent the third and fourth pooling layers, respectively; f (Á) selects the optimal pooled output according to the pixel number of target regions returned by the MF-APN network. Choosing the power of 2 as the segmentation point is helpful for network training. 29 F(Á) indicates the last full connection and softmax operation. When N is too large, the output of the final P 5 should be selected to describe the feature of the large target. When N is small, P 3 with more information should be selected. The classification result of the image I is a fusion of the prediction labels of the three-scale VGG-SDP networks. First, normalize each prediction label Y (s) , put it into a fully connected layer, and then use the softmax function to obtain the final prediction classification label. 12

MF-APN location
The original location network APN contains two fully connected layers. The feature area is obtained by inputting P 5 . Once the network detects the feature area, it will crop and enlarge the area to obtain the input of the next scale layer. This method uses a square box to mark the target, and it is prone to include the part that does not belong to the feature or to cut off the part that originally belongs to the feature. Especially for ships, the characteristics of the forecastle, bridge, smoke outlet, radar antenna, and so on have specific regular geometric shapes and are mostly long rectangles, so it is easier to count into the background using only square boxes. In this article, we use different a priori rectangular boxes to mark the target. Finally, classifying multiple feature regions and performing joint discrimination can identify the target more robustly. Therefore, a multi-feature map (MF-APN) method is proposed.
Assume that the APN part outputs t x and t y represent the coordinate values of the center point of the target area, and t l is half the side length of the square box. N is the number of pixels in the square box, that is, the target area. W i and H i , respectively, represent half the length and width of the ith prior rectangle, and the scale factor k i represents the length-width ratio of the ith rectangle, as shown in Figure 4.
Then the following equations can be derived Assume that the area of the a priori rectangular box is equal to the area of the output square, and there are Substituting equation (4) into equation (5), we get a new expression The ½Á function means rounding down to prevent square root operations from generating decimals. Use the top left and bottom right corners of the a priori rectangle to represent this rectangle. The subscript ul stands for the upper left corner and br stands for the lower right corner; then the coordinates of the two points are Considering that the neural network backpropagation algorithm is derivable, we cannot use ordinary interception functions. Therefore, we design a deductible intercept function D(Á) 12 where h(Á) represents a sigmoid function and the formula is as follows When a is large enough, the value of h(Á) is always 1 where the pixel point in the feature area, and a = 10 is set in this article. The final intercepted target area M i can be expressed in the following form where the operation represents the element point multiplication.
The bilinear interpolation method is then used to enlarge the target area to generate the inputs of the next scale. If we choose plurality of the a priori rectangular boxes in each scale after the first scale, finally, the number of feature areas will become multiplicative growth. Considering the computational cost, only i rectangular boxes are extracted in the second scale layer. In addition, t l in the new scale cannot be less than one-third of the previous scale, preventing the feature region from being too small to contain the feature portion.

Network loss
The improved RA-CNN loss function is still divided into two parts, the same as in the original RA-CNN: 12 namely, the intra-scale classification loss L innner and the inter-scale loss L scale , where L inner is a general crossentropy calculation and L scale is calculated as follows By taking the maximum value, the network is required to update when the current scale real class probability p (s + 1) t is smaller than the previous scale real class probability p (s) t , which causes the network to have a higher prediction probability on a finer scale. The loss between scales is only updated when p s ð Þ t À p s + 1 ð Þ t + 0:05.0. The addition of 0.05 is to prevent both sides from staying in stagnation when they both are equal to zero.
Since there are M feature rectangles in the second and third scales, the final predicted category label Y (s) is the weighted average of the predicted probabilities of the M feature rectangles. For the prediction probability p (s) j of the jth class in Y (s) , the formula is where M represents the number of rectangular boxes and a i represents the weight of the ith rectangular box and has P M i = 1 a i = 1. In this article, we set M = 3 (contains a square box), the values of the scale factor k are 2, 1, and 0.5, respectively, and the corresponding weight

The hardware environment
The experimental environment is TensorFlow 1.10.0, OpenCV3, CUDA 9.0, and CUDNN 7.0. The tasks are executed on a computer with Intel Core i7-7800X with 16 GB RAM and an NVIDIA GTX1080Ti with 11 GB memory.

The dataset
Many open-source fine-grained image datasets can be found on the network such as CUB Birds, Stanford Dogs, and Stanford Cars. These datasets can well test the performance of fine-grained classification networks. However, this article proposes a special algorithm for ships and basically no datasets about the ship can be found on the network. Therefore, in this article, a selfbuilt ship dataset is used to verify the network performance. The dataset consists of eight categories, namely, container ships, sailboats, tugs, passenger ships (PS), tankers, ocean surveillance ships (OSS), engineering ships, and fishing boats (FB). There are 2635 samples in the dataset, with about 300 samples in each category. The number of images is still relatively small and we will continue to expand the amount. You can find our dataset at https://www.dropbox.com/sh/pyyfko-tec7h9r5l/AAA8yVwhr3N7VwndU_lOCsKLa?dl=0 The ratio of the samples used for training and testing is approximately 4 : 1. The specific distribution of the dataset is shown in Table 1. Each ship class includes the front view, rear view, side view, and the sideoverhead view to ensure accurate identification in all directions during identification. The samples are shown in Figure 5. The image resolutions are almost between 500 3 400 and 300 3 200. The target accounts for almost 50%-70% of the picture.

Parameter settings and evaluation metrics
In this article, we take several neural network acceleration methods to promote network convergence and accelerate network training. The drop rate of the neural network is set to 0.2, and the initial learning rate is set to 0.1. After 100,000 iterations, the learning rate is reduced to one-tenth of the original. To speed up convergence, the network will be trained 10 rounds on the ImageNet dataset in advance.
Use Accuracy as an evaluation metric to test neural network performance; 32 the accuracy calculation formula is as follows where TP is true positive (predict the positive sample correctly), TN is true negative (correctly predict but the negative sample), FP is false positive (predict the negative sample as the positive sample), and FN is false negative (predict the positive sample as the negative sample).

Experiment 1: calculating the amount and time
For deep learning networks, computational complexity is a very important factor to evaluate the efficiency of the network. The RA-CNN is effectively designed on the network structure to improve the fine-grained  Training  304  256  240  256  160  256  368  272  Test  69  72  62  60  43  64  84  69 PS: passenger ships; OSS: ocean surveillance ships; FB: fishing boats. detection performance. However, due to the adoption of multiple scale layers, the computational complexity of the network is increased. Meanwhile, the MF method in this article improves network accuracy by adding three feature regions at the second and third scale layers. That is, the network parameters of the second and third scale layers are increased by nearly three times, and the overall parameter amount and calculation time should be correspondingly increased. So it is necessary to test the impact of the network complexity. The overall network structure and parameters are shown in Table 2. 24 It is worth mentioning that the parameter values in the APN-FC1 layer are uncertain. This is because the input may come from one of the three different pooling layers, so the MF-APN part prepares three input entries. The computational costs of different models are illustrated in Figure 6. The Flops 33 in the figure are used to measure the amount of network calculation, and the calculation formulas for the convolutional layer and the fully connected layer are where H, W represents the size of the convolution kernel, C in represents the number of channels input, C out represents the number of channels output, and N in and N out represent the numbers of inputs and outputs of the fully connected layer, respectively.
From Figure 6, it can be seen that RA-CNN is nearly 0.7 s longer than the VGG19 in single-frame image processing time due to its three-layer progressive structure. The figure size in Figure 6 is proportional to the amount of network parameters. Therefore, it can  also be seen from Figure 6 that the SDP algorithm hardly increases the parameters and calculation time, while the MF algorithm increases nearly 0.6 s and the floating-point calculation amount is also greatly improved. It takes about 1.5 s for the new network to process a single frame, which is not feasible for real-time video detection. Considering that each scale's input is calculated from the previous scale, this serial structure is difficult to optimize. However, three feature areas can be calculated independently at the same time, so the MF portion can use the distributed computing method to shorten the calculation time to 0.92 s. From the experimental results, the network studied in this article can be used for real-time recognition of single-frame images. However, when applied to video recognition, a high frame rate cannot be guaranteed. It is still necessary to optimize the reduction of the overall floating-point calculation from the network parameters.

Experiment 2: accuracy of different baseline models
The model used in this article is based on RA-CNN, while RA-CNN is based on VGG19. We propose two improvement methods, which are SDP for choosing pooling results flexibly and MF-APN for more precise category discrimination.
In this experiment, we combine the RA-CNN and two proposed methods one by one to test the effect of a single method. Finally, we compare the method of this article with baseline models. The results are shown in Table 3. The overall accuracy curve is shown in Figure 7.
It can be seen from Table 3 that the accuracy of RA-CNN is 5.8% higher than that of VGG19 as the basic detection network, indicating that RA-CNN is a better network for fine-grained image detection and classification. The results of using SDP and MF alone were 0.5% and 0.9% higher than the original RA-CNN network, respectively. When using both SDP and MF, the accuracy is improved by up to 2%. This implies that the use of adaptive feature pooling and the use of multiple feature maps can effectively improve the accuracy and the combination of the two methods can achieve better results. It can be seen from Figure 7 that the unmodified VGG19 can achieve a certain accuracy faster. In contrast, the network based on RA-CNN will take more time to train on MF-APN, but the final result will be much better.
In Figure 8, the effect of the network detection ship is demonstrated. In order to better display the type of ship, the feature area generated by MF-APN is used. It must be acknowledged that, during the testing process, the network will still classify some ships into the wrong category, so in the future work it is still necessary to expand the dataset and optimize the network.

Experiment 3: accuracy and attention region of different scales
The original RA-CNN model includes three scale layers to local feature map step by step. However, the accuracy of the third scale layer can be less than those of the first two layers. In this experiment, we will test the impact of new methods on the accuracy of different scales. The experimental conditions are the same as those in Experiment 1. The difference is that only the different scale layers of RA-CNN and RA-CNN + SDP + MF are tested. The results are shown in Table 4.
It can be seen from Table 4 that the accuracy of using only the second scale in the original RA-CNN is 0.9% higher than that when using only the third scale. But this gap is narrowed to 0.2% in the method proposed in this article. This proves that the new method can compensate for the accuracy impact of global information loss on a finer scale.
We also test the attention region on the trained model. Figure 9 shows the attention region of each scale of the network. Figure 9(a) shows the original picture and Figure 9(b) shows the input for the first scale  after preprocessing. From Figure 9(c), we can see that, after MF-APN, the first scale selects three attention areas. These areas will be intercepted once and the corresponding pictures in Figure 9(d) will be generated. It can be seen from the results that, after training, the network can accurately locate the target's attention region in the second scale.

Conclusion
Based on the fine-grained detection network RA-CNN, this article proposes an improved identification method for intelligently distinguishing fine-grained ships. As an end-to-end weak surveillance detection network, training can be performed with no bounding boxes provided. The SDP algorithm allows the classification network to adaptively select the most representative pooling layer output, which enhances the network's ability to learn local features. The MF method is proposed for the special geometric appearance characteristics of ships. By extracting multiple prior feature boxes in the image to comprehensively describe the feature regions, the algorithm avoids situations where the single feature area may contain too much unrelated background. At the same time, the weighted average decision prediction label is used to enhance the influence of rectangular boxes. Using batch normalization and dropout, changing the learning rate and using pretraining weights can optimize the network model and accelerate the convergence of network training. Our experiments show that this method can quickly and accurately classify and detect ship photoelectric targets in complex backgrounds and provide early warning against collisions.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the National Science Foundation of China (Grant No. 61673259).