Explicit feature disentanglement for visual place recognition across appearance changes

In the long-term deployment of mobile robots, changing appearance brings challenges for localization. When a robot travels to the same place or restarts from an existing map, global localization is needed, where place recognition provides coarse position information. For visual sensors, changing appearances such as the transition from day to night and seasonal variation can reduce the performance of a visual place recognition system. To address this problem, we propose to learn domain-unrelated features across extreme changing appearance, where a domain denotes a specific appearance condition, such as a season or a kind of weather. We use an adversarial network with two discriminators to disentangle domain-related features and domain-unrelated features from images, and the domain-unrelated features are used as descriptors in place recognition. Provided images from different domains, our network is trained in a self-supervised manner which does not require correspondences between these domains. Besides, our feature extractors are shared among all domains, making it possible to contain more appearance without increasing model complexity. Qualitative and quantitative results on two toy cases are presented to show that our network can disentangle domain-related and domain-unrelated features from given data. Experiments on three public datasets and one proposed dataset for visual place recognition are conducted to illustrate the performance of our method compared with several typical algorithms. Besides, an ablation study is designed to validate the effectiveness of the introduced discriminators in our network. Additionally, we use a four-domain dataset to verify that the network can extend to multiple domains with one model while achieving similar performance.


Introduction
Place recognition is a vital ability for robots. Ranging from autonomous driving to flying robots for precision agriculture, different kinds of sensors are leveraged in different localization scenarios, among which cameras are gaining more and more popularity. The main advantage of imagery sensors is their low cost, compared to expensive light detection and ranging (LiDAR), inertial navigation system (INS), etc. Visual localization has been studied for years, and many visual simultaneous localization and mapping (SLAM) systems are proposed, 1,2 which achieve impressive performance under ideal conditions. Visual place recognition plays two roles in a SLAM system: (1) in the mapping stage, robots need to find loop closure, so as to reduce drifting and build a global consistency map and (ii) in the localization stage, localization may sometimes fail, such as the kidnapped robot problem, which also needs loop detection. Most loop detection algorithms firstly try to coarsely relocalize the robots against a known database using place recognition, followed by pose estimation. Thus, visual place recognition may affect mapping accuracy and relocalization success rate.
As the community pays more and more attention to scenes with appearance changes, such as the urban environment, challenges for visual place recognition also come. A typical visual place recognition pipeline includes feature extraction, 3,4 feature matching, 5,6 and temporal fusion, 7,8 among which feature extraction is now a bottleneck. The recent success of deep learning has shown a great potential for a neural network to become a robust feature extractor. Thus, this article tries to improve the feature extraction module using deep learned features. The main challenge for handcrafted visual features is that they are sensitive to appearance changes, such as a shift from day to night and seasonal transitions. Some methods exploit deep learned features from supervised learning. 9,10 When testing data are similar to training data, these supervised methods perform very well. However, they require massive manually labeled data, which are labor-intensive and timeconsuming. That may be unsuitable for some place recognition scenarios. Thus, self-supervised or unsupervised methods are preferred. Autoencoder learns features in a self-supervised way, and the output of the encoder is shown to be suitable for place recognition. 11 In their work, training data and testing data are quite different; therefore, it pays more attention to generalization ability. Compared to supervised learning, it sacrifices performance to obtain generalization. To find a good balance between performance and generalization, some researchers assume that training and testing data can share some similar properties, such as the same place, the same sensors, and similar illumination, but without position-level alignment. 12,13 This assumption is reasonable in some applications, such as inspection robots working in the same place at different times. ToDayGAN translated from nighttime images to daytime style using generative adversarial network (GAN), and the generated images are used to match with images captured in the daytime. 12 Although nighttime images in the training and testing phase look similar, they still have subtle appearance differences because they are captured at different times. Our article is also targeted at this setting. The main difference is that features in ToDayGAN are extracted from the translated images after the image translation, while our method directly puts constraints on features.
In this article, we are set to construct the neural network that can explicitly disentangle domain-unrelated and domain-related features of images from different domains, and the domain-unrelated features are used for place recognition. The only supervised information is the domain, where a domain denotes a specific appearance condition, such as daytime in spring or nighttime in summer. For example, by assuming that the appearance does not change too much during a short period, we can label all the images as some domain according to the appearance condition at that time. By grouping images into different domains, the proposed network can find the invariant information among them. This idea is motivated by the hypothesis that in the place recognition application, an image can be seen as the composition of place content and appearance content, where the place is the domain-unrelated content of a scene (e.g. corners or edges of buildings), whereas appearance is the domainrelated properties (e.g. brightness of sunlight and type of season). Under this hypothesis, based on the definition of disentangled representation, 14 we disentangle place and appearance features using two encoders, which are part of an autoencoder. To ensure that the appearance feature only corresponds to domain-related content, an adversarial loss is applied on pairs of place features and appearance features. Besides, another adversarial loss is designed to constrain the place features mapping to the same latent space, such that the appearance feature only corresponds to domain-related information. As a result, the place feature is robust against appearance changes, and it is used as the descriptor in visual place recognition. The network is trained in a self-supervised manner without the requirement for aligned image sequences, and only domain information is needed, which is also called weak-supervised in some articles. This can be easily achieved by firstly collecting sessions of images under the same conditions and then marking each session as one domain. Besides, the network is shared across different domains, which allows our model to be adaptable to more domains without increasing the number of parameters. The main contributions of this article are listed as follows: A data modeling method is presented for visual place recognition, and a self-supervised feature learning method is proposed to disentangle the domain-unrelated and domain-related content from multiple-domain images. A disentangled feature learning network based on adversarial learning is proposed, which is able to be extended to multiples domains without increasing model complexity. This makes our network feasible for applications with limited resources. Two toy case studies are carried out to validate our feature disentanglement method with qualitative and quantitative results. We also try to interpret our network in this part. Experiments for place recognition are conducted on three public dataset and one newly proposed dataset. Our method shows favorable performance. We also open the source code for reproduction (https:// github.com/dawnos/fdn-pr).
This article is an extensive study based on our previous work. 15 One additional contribution is that we improve our network to get higher performance by reconstructing highlevel image feature instead of the original image. Another one is that we try to interpret our network through theoretical discussion and ablation study. Finally, more datasets and comparison methods are employed to verify the proposed disentanglement method.
The remainder of this article is organized as follows: related work is discussed and summarized in the second section. Then in the third section, we present the data model used in this article. Our method will be presented thoroughly in the fourth section. We will introduce the experiments in detail and show results in the fifth section. The conclusion will be made in the sixth section.

Handcraft features
In the early years, handcrafted features are used for place recognition. These features can be classified into two categories, namely global features and local features. Methods based on handcrafted global features try to assign appearance-invariant features to each image directly. In some early works, histograms of oriented gradient (HOG) 24 features are used as descriptors to compute thedistance between database and query image, where the gradient is able to overcome simple illumination changes. 7 Later, the GIST descriptor, which represents the high-frequency part of an image, is also leveraged because the human eye is more sensitive to it. 25 On the other hand, local-featurebased methods firstly extract a bunch of local features and then aggregate them into global features. In place recognition, performance is limited by the choice of local features, thus local features with an appearance-invariant property are preferred. Gradient-based local features such as Scale-Invariant Feature Transform (SIFT) 3 and Oriented FAST and Rotated BRIEF Oriented FAST and Rotated BRIEF (ORB) 4 are found to be robust to small appearance changes. Especially, the ORB descriptor is used in the place recognition (loop closure) module of the widely used ORB-SLAM system. 26 In addition, different aggregating algorithms are designed to generate global descriptors, such as a bag of visual words 27 and Vector of Locally Aggregated Descriptor (VLAD). 28 Due to the limited robustness of handcrafted features, these methods are sensitive to extreme appearance changes, thus not preferable in visual place recognition.

Supervised features
After the great success of deep neural networks in the computer vision area in recent years, researchers start to explore how can visual place recognition benefit from deep learning. In one of the earliest trials, different layers in Alex-Net 29 are reported to have different place recognition performances. 30 The authors find that features from the middle layers are more robust against changing appearance. However, the AlexNet features are not good enough, because the network is pretrained on the imagenet largescale visual recognition challenge dataset, 31 which is different from the testing set. 11 To go further, different supervised methods are proposed, which constrain the extracted features of images from the same place to be similar. 9,10,[32][33][34][35] One way is to use a classification network for place recognition where each place is a class, and this method can achieve comparable results. 9 The network is trained and tested on a dataset captured from static cameras at different times. This work does not need manual labeling, and it shows the potential for the neural network to distinguish places. However, as places are determined once training is finished, it cannot be extended to other scenes. To make the network more flexible, NetVLAD improves the competitive aggregating method VLAD 28 by using soft assignment, which becomes a differential module. 10 With the module, the feature network can be optimized by triplet loss. Labeled data from Google Street View Time Machine are needed to construct training tuple. Besides, another research assigns multiple images to one place and fuses their features together to boost performance, which uses two datasets aligned by Global Positioning System (GPS). 32 In addition to improving the features, finding useful regions for place recognition can also help, such as stable return on investments [36][37][38] and attention maps. [39][40][41] These supervised methods achieve impressive results, but the requirement for massive labeled data may be hard to fulfill in the fast deployment.
It is also possible to make the extracted features more robust by postprocessing before the feature matching module. 42,43 For example, the original AlexNet features 30 can be further improved by applying PCA to those features followed by only keeping the components with small eigenvalues, because the principal components with large eigenvalues represent variations in the images. 42 Additionally, quantifying the features using hashing can speed up to the matching process. 43 These methods are not competitors of our method, and instead, they can be added to any feature extractors to get high performance.

Self-supervised features
Another branch of machine learning methods, namely selfsupervised learning, does not rely on aligned data. Autoencoder 44 is a popular self-supervised network architecture, which is used in many place recognition researches. 11,[45][46][47] The output features of encoders exhibit robustness in place recognition. The robustness of the learned features can be improved by reconstructing corrupted input images, 47 or trying to reconstruct HOG of the input images. 11 One advantage of the latter one is that their training sets are different from the testing set, demonstrating favorable generalization.
In recent years, adversarial learning is getting more and more attention in the literature on self-supervised learning. Inspired by the work of style transfer 13,48,49 using GAN, 50 some researchers try to transfer query images to match the style of database images, followed by local feature matching, global descriptor matching, or dense matching. 12,51-54 For example, ToDayGAN transfers nighttime images into daytime ones, extracts features using DenseVLAD, 55 and finally matches with daytime images in the database. 12 Each model in these methods is targeted at the two domains used in the training phase. When adding new domains, new models are needed, and the number of models increases rapidly. Another method is enhancing the pretrained NetVLAD 10 with semantic information, which is shown to be viewpoint-invariant. 56 It demonstrates that appearance-based descriptors, such as our method, can be improved to overcome changing viewpoints with this technique.

Data modeling
In this section, we formulate our problem and define disentangled representation. The input data are modeled as a multiple-domain generative process, and the feature extraction module is modeled as an inference process. These two processes are summarized in Figure 1. Based on these, we derive our definition at the end of this section.

Generative process
Images are the reflections of the world. Let W be the state space of the world. An image x 2 X can be seen as the observation of a state in W, where X is the image space. They are connected by a generative process g : W ! X, written as follows In our setting, images are from different domains, where each domain represents a specific type of appearance condition, such as daytime in summer or nighttime in winter. We assume that the world space W can be decomposed into domain-unrelated space S and domain-related space A, namely W ¼ S Â A. S contains place information (such as the structure of a building or shape of traffic sign) and A contains appearance information (such as sunny daytime in spring). Now, the generative process can be rewritten as Any image x 2 X can be seen as the composition of a domain-unrelated state s 2 S and a domain-related state a 2 A under g. We assume that states s and a are sampled independently from latent spaces S and A, respectively.
Under the multi-domain setting, latent space A can be divided into different subspaces, such that where n is the number of subspaces of A. With a i 2 A i sampled from one of these subspaces, the generated image x i can be written as Given a i and a j sampled from different subspaces, we assume that the generated images x i and x j follow different distributions or saying from different domains. We will use the notation X i to denote a domain (i.e. the set of images collected from spring) in the rest of the article. Apparently, we have n domains.
Images taken from the same place at different times with different appearances share the same s and vary a. On the contrary, images collected continuously with similar appearances have identical a and different s.

Inference process
Our goal is to find two functions h S and h A to extract disentangled representationŝ i andâ î They are estimated from the inference processes h S and h A , and we will refer to them as place feature and appearance feature, respectively. Apparently, we desire the place feature to be domain-unrelated and the appearance feature to be domain-related. Feature pair z ¼ ðŝ;âÞ is constituted of a place featureŝ and an appearance featureâ, and all the zs constitute the feature space Z.
As s and a i in Eq. (4) are independent variables,ŝ i andâ i are also desired to be disentangled. Formally, a disentangled representation calls for that a change inŝ i only corresponds to a change in s, while a change inâ i only corresponds to a change in a i . 14 In our setting, disentanglement requires thatŝ i andâ i should be unrelated. Additionally, these two features should have semantic meaning. For example, in our problem, it is required that s i represents the place contents, such as the buildings and traffic signs in sight, whileâ i represents the appearance information, such as the weather and the season. At an ideal state, a change inŝ i should represent a change in place (i.e. moving 5 m forward), while a change inâ i should represent a change in appearance (i.e. shifting from sunny to rainy).

Adversarial disentangled feature learning
In this section, we present our feature disentanglement network in detail, including network architecture and the loss functions, which are summarized in Figure 2. At last, we introduce how to train the network and extend it to multiple domains.

Network architecture
The motivation of the proposed network is to extract disentangled features from given images. To achieve it, we set to constrain features explicitly to fulfill the disentanglement requirement.
Firstly, our method uses an autoencoder as the feature extractor. Autoencoder is a widely used self-supervised machine learning method, where the encoder encodes the input as a feature and the decoder tries to reconstruct the original input from the feature. Our encoder is composed of two parts, namely the place encoder E S and appearance encoder E A , which is used to approximate h S and h A , respectively. The decoder G is used to reconstruct the input image, which ensures that the network learns meaningful features. Additionally, the dimension of the place and appearance features are usually smaller than the input dimension, leading to a more compact representation. Pure autoencoder is not enough for disentanglement, as there is no constraint forŝ andâ to be disentangled. As discussed above, we also need to ensure that a change inŝ only corresponds to a change in s, and a change inâ only corresponds to a change in a. We solve this problem in two steps. Firstly, we introduce an appearance compatibility discriminator into the autoencoder, which is designed to ensureŝ andâ that are unrelated. Secondly, a place domain discriminator is designed to ensureŝ that only has place information, leading to the disentanglement ofŝ andâ. For simplicity, we start from two domains (X 1 and X 2 ) where x 1 and x 2 are sampled from domains X 1 and X 2 , respectively. Our autoencoder follows the widely used bottleneck architecture, where the input is downsampled into two smaller feature maps by the encoders, while the decoder tries to recover the input from those two feature maps. The discriminators take combinations of those feature maps as input and give one dimension as output ( Figure 2). The detailed architectures of different experiments are listed in the appendix.

Autoencoder
To measure the reconstruction quality, different distances can be used. In the previous version, we used L2 loss for reconstruction, which can be expressed as follows 15 where x i is an image sampled from some data distribution pðx i Þ of domain X i . For specific tasks, such as place recognition, L2 loss is not the best choice. In this article, for the place recognition task, we try to further improve the reconstruction quality by constraining the reconstructed image to have similar high-level features to the original image. The high-level features are more meaningful because they have more "perceptual" information. 57 Formally, this can be written as follows where F is a pretrained VGG feature extractor. 58 With this perceptual loss, the reconstructed images are more semantically meaningful than the ones with L2 loss, which also leads to more meaningful features. This helps for boosting place recognition performance, as shown in the experiments.

Appearance compatibility discriminator
As pointed out in the last section, disentanglement requires thatŝ only corresponds to s whileâ only corresponds to a. We firstly turned it into a "weaker" version thatŝ andâ are unrelated. As stated before, we want to constrain the features directly. We construct two tuples, ðŝ 1 ;â 1 Þ and ðŝ 2 ;â 1 Þ, and constrain them to follow the same distribution.
To fulfill this constraint, we design a discriminator, the appearance compatibility discriminator, which is denoted by D app . The input of D app is the constructed tuples ðŝ 1 ;â 1 Þ and ðŝ 2 ;â 1 Þ. We use an adversarial network to minimize the distance between these two tuples. The discriminator D app and the encoders E S and E A are optimized successively. The encoders try to learn features, which minimize the distance between ðŝ 1 ;â 1 Þ and ðŝ 2 ;â 1 Þ. On the other hand, the goal of D app is to tell whether the given tuples follow the same distribution. In addition, instead of using Jensen-Shannon -divergence as a distance metric, 50 we use the Pearson 2 divergence to get faster convergence, which uses L2 loss without sigmoid function. 59 During training, D app and encoders (E S and E A ) are updated successively. The loss function for D app is whereŝ 1 ,ŝ 2 , andâ 1 is given by E S ðx 1 Þ, E S ðx 2 Þ, and E A ðx 1 Þ. x 1 and x 2 are images from domain X 1 and X 2 , respectively. Encoders E S and E A are encouraged to confuse D app . The loss function for autoencoder can be expressed as follows It is worth noticing that E S ðx 1 Þ, E S ðx 2 Þ, and E A ðx 1 Þ in Eq. (10) are in factŝ 1 ,ŝ 2 , andâ 1 in Eq. (9). We replace them to remind that E S is fixed when updating D app (Eq. (9)), while D app is fixed when updating E S and E A (Eq. (10)).
Equations (9) and (10) are formulated for the case that the first input image is sampled from X 1 . When the first image is from X 2 , we can derive L app;2 D and L app;2 Adv by exchanging two domains in Eqs. (9) and (10).

Place domain discriminator
With the appearance compatibility discriminator, we can only ensure thatŝ andâ are unrelated. An extreme situation is thatŝ contains both place and appearance information, whileâ learns little information. This happens when the dimension ofŝ is large enough to encode both s and a. It often holds true when we do not know the actual dimension of s and a, thus we have to set the dimension forŝ larger thanâ. The reason is that place information is often much richer than appearance information for any given image. To address this problem, we construct another discriminator to ensure thatŝ only contains place information. If satisfied, combined with D app , a change inŝ will only correspond to a change in s. At the same time, all the appearance information will be represented inâ with the reconstruction constraint. Thus, all the place features from X 1 and all the place features from X 2 are desired to follow the same distribution. When this holds true, it means neitherŝ 1 norŝ 2 is affected by domain-related states.
Supervised methods like NetVLAD constrain features from the same place with different appearances to be the same, 10 which requires for alignment information. But as we hope to override the dependency on aligned data, we accomplish it in a different way. Constrainingŝ 1 andŝ 2 to follow the same distribution is applicable using adversarial learning, but it is hard to extend to multiple domains. The reason is that such discriminator is domain-specific, and more domains require more discriminators. To overcome this, we construct the place feature pairs ðŝ 1 ;ŝ 1 0 Þ and ðŝ 1 ;ŝ 2 Þ, whereŝ 1 0 is another place feature from domain X 1 . In particular, we sample another image x 1 0 from the domain X 1 and extract its place featureŝ 1 0 . Then ðŝ 1 ;ŝ 1 0 Þ and ðŝ 1 ;ŝ 2 Þ are constrained to follow the same distribution, which implies thatŝ 1 andŝ 2 follow the same distribution. Now we introduce the place domain discriminator, denoted by D pla . D pla is designed to constrain ðŝ 1 ;ŝ 1 0 Þ and ðŝ 1 ;ŝ 2 Þ to follow the same distribution. Similar to the D app , the discriminator D pla and the place encoder E S are optimized in an adversarial way. Now we can formulate the loss function for D pla , which can be written as follows whereŝ 1 andŝ 1 0 are derived from E S ðx 1 Þ and E S ðx 1 0 Þ, with x 1 and x 1 0 being two samples from the same domain X 1 .
The discriminator D pla can be interpreted from another perspective: it outputs 1 if the given two-place features are from the same domain and 0 for those from different domains. Again, the loss function for the encoder E S can be expressed as follows Similarly, we can have L pla;2 D and L pla;2 Adv .

Training strategies
The discriminators (D pla and D app ) and encoders (E S and E A ) are updated in turn by back-propagation. The only difference is that encoders are updated with the decoder G together. Each iteration of the training is constituted of three steps: (i) updating D pla , (ii) updating D app , and (iii) updating E S , E A , and G. Each training iteration can be formally written as follows min where l 1 and l 2 are hyperparameters to balance reconstruction and disentanglement, which is determined by grid search for different tasks. When training with only two domains, images x 1 and x 1 0 are sampled from the first domain, while x 2 and x 2 0 are from the second one. The network is trained with Adam optimizer, 60 with b 1 ¼ 0:9 and b 2 ¼ 0:999. Learning rates for autoencoder (including E S , E A , and G), D pla , and D app are set as 0:000003, 0:00001, and 0:00001. During training, the training datasets are augmented to increase robustness against viewpoint changes. For each image, we select four points on it randomly, and the surrounded area is cropped and wrapped into the original size. 11 Besides, the wrapped image is flipped horizontally randomly.

Extension: Multiple domain case
Based on this domain-unrelated architecture, we present how to extend our network to multiple domains. Assume that there are n domains, denoted by X 1 ; X 2 ; Á Á Á ; X n . In each training iteration, we firstly select two domains X i and X j randomly, where i; j ¼ 1; 2; Á Á Á ; n and i 6 ¼ j. Then two batches of images are randomly sampled from X i and X j , respectively. These images are fed into the network as training data for this iteration. During testing, when matching between any two of those domains, images are fed into E S to obtain place features, followed by feature matching. This extension does not require additional parameters.
To see it, one should note that the autoencoder is shared across different domains. Besides, the discriminators are also shared and domain-unrelated, which only use information between two domains. For example, if the inputs of D pla are ðŝ i ;ŝ i Þ and ðŝ i ;ŝ j Þ, D pla can be regarded as a metric to measure the distance between ðŝ i ;ŝ i Þ and ðŝ i ;ŝ j Þ, instead ofŝ i andŝ j . It is similar to the triplet loss, which measures the metric distance between inner classes and outer classes. 61 The parameters are shareable among different domains, and no additional parameters are needed when extending to more domains.
The extension enables that only one model is needed in a specific scene for long-term deployment. In the beginning, we have a baseline model trained from several domains. When new data with different appearances in the same area are available, the model can be fine-tuned by retraining on the dataset enhanced with newly collected data. The retraining does not require additional parameters. In contrast, style-transfer-based methods need new models to transfer new data into known styles. When new data come periodically, this will lead to quadratically increasing parameters. To see it, one can assume that there are m domains. To transfer each domain to others, they need to train m 2 ¼ 1 2 mðm À 1Þ models. Thus compared to styletransfer-based methods, our method is more suitable for plugging into any long-term localization framework as the feature extractor. 62,63

Experiments
We conduct several experiments to illustrate our method. Firstly, we validate the proposed network with two toy cases, including Linear Gaussian for quantitative analysis and Colored MNIST for qualitative analysis. Then, we apply our network to visual place recognition and demonstrate its performance in different perspectives, including basic performance, ablation study, and multiple-domain performance.

Toy case validation
Linear Gaussian. We test the network on a linear gaussian generative process with two domains to validate whether it can produce disentangled representation as desired. The reason to choose the linear model is that we can use correlation as a quantitative metric for disentanglement in the linear case. The generative process can be written as follows where s and a i are vectors with the dimension of n s and n a , and they are sampled from two normal distributions, respectively, with m s and m a;i as means and s 2 s and s 2 a;i as covariance matrices. Elements in s are independent of each other, thus s 2 s is the diagonal matrix, so as s 2 a;i . The generated data point x is n x -dimension vector. In our setting, n x > ðn s þ n a Þ, thus each dimension in the generated data is not totally independent of others. This often happens in the natural world, such as pixels in an image, which are not independent of each other. Specifically, n s ¼ 5, n a ¼ 5, and n x ¼ 50.
The encoders E S and E A are single-layer linear modules. For another, the discriminators are multiple-layer nonlinear modules, with leaky rectified linear unit (LReLU) as an activation function. 64 To demonstrate the power of the two proposed adversarial losses, we compute the correlation matrix between the place featureŝ and appearance featureâ i of our method (Figure 3(c)). In comparison, we train a model with only an autoencoder (Figure 3(b)). Both of them are trained from the same initial state (Figure 3(a)). By comparing Figure 3(b) and (c), we can see that the introduced loss functions can produce less correlatedŝ andâ i . Besides, the correlation between place feature and appearance feature is nearly zero (Figure 3(c)), which shows that the features are disentangled very well. To further ensure that D pla is necessary, we train the network with only D app (Figure 3(d)). From Figure 3(c) and (d), we can see that without D pla , correlation is slightly larger. Thus, D pla can improve the disentanglement to some extent. Figure 3(a) to (d) shows the case that n a ¼ nâ ¼ 5 and n s ¼ nŝ ¼ 5, where nâ and nŝ are the dimensions of appearance feature and place feature. As discussed above, in practice it is better to set larger nâ and nŝ because we need to ensure nâ > n a and nŝ > n s but n a and n s are unknown. To simulate the practical case, in Figure 3(e) to (h), we set nâ ¼ 10 and nŝ ¼ 10. We get similar results to the former case, except that the autoencoder becomes more correlated when we increase the dimension of the feature space (Figure 3(b) and Figure (f)). When the features are overparameterizing, purge autoencoder cannot guarantee that place feature and appearance feature are disentangled, and our method can effectively solve this problem.
Colored MNIST. The linear Gaussian is a simple toy case. To show the power of our method in a more complicated scenes, we propose colored MNIST, an enhanced version of the handwritten digits dataset MNIST. 65 In this dataset, the original grayscale digits are colored randomly in seven colors to simulate different domains. The colored MNIST is constructed to fulfill our hypothesis that the input data are composed of domain-related and domain-unrelated components. Based on this, we train our network and analyze the learned features qualitatively. Dimensions of features nâ ¼ 8 and nŝ ¼ 4 Â 4 Â 64 ¼ 1024. In training, l 1 and l 2 are set as 0.1.
To investigate what has the network learned from this toy case, we visualize the place feature and appearance feature using t-distributed stochastic neighbor embedding (t-SNE). 66 As the dimensions of our features (1024 forŝ and 8 forâ) are too high for visualization, we employ t-SNE, which reduces high-dimensional data into twodimensional ones for visualization. In this experiment, the "place" content is a digit, while the 'appearance' content is color. Each point in Figure 4 Figure 4(a) is determined by the digit of the sample. Figure 4(b) is similar except that locations correspond to appearance features. From Figure 4(a), we can see that the place features can be easily clustered by digits, and they are unrelated to their color. It means that the place feature corresponds to the digit, and it is not affected by the color. Conversely, from Figure 4(b), we can find that the color information is already embedded into appearance features, and the same digit with different can have varied appearance features. Thus, the appearance feature corresponds to the color, and it is not affected by the digit. From these results, we can safely conclude that our network can obtain disentangled representation.
We also try to analyze our network in image space. As done in some image-to-image translation literature, 67 we can replace the appearance features with those from another domain for the decoder to obtain image with different style. We implement this in two ways. Firstly, given two colored digits as inputs, we combine the place feature from the first digit and the appearance feature from the second digit and reconstruct a new digit through the decoder. The newly recovered digit is called translated image. Secondly, the appearance feature is replaced with a new vector filled with zeros, and the decoded image is called zero-appearance image. Results are presented in Figure 4(c). We sample images from every two domains. As there are seven domains in the dataset, there are 7 Â 7 subplots in Figure 4(c), except for the samples from the same domain (diagonal). Each subplot can be denoted by p i;j , where i ¼ 1; 2; Á Á Á ; 7; j ¼ 1; 2; Á Á Á ; 7; i 6 ¼ j. p i;j can be further divided by four parts, namely p i;j ðm; nÞ, where m ¼ 1; 2; n ¼ 1; 2. For example, the two input images for p 1;1 are a blue five (p 1;1 ð1; 1Þ) and a white seven (p 1;1 ð1; 2Þ). The translated image is a white five (p 1;1 ð2; 1Þ), while the zero-appearance image is a green five (p 1;1 ð2; 2Þ). We can see that the translated image keeps the shape of the first input image, while its color is determined by the second input image. It is also found that the colors of all the zero-appearance images are the same, while their digits remain the same as their original first input images. Thus, we can say that the digit of the reconstructed image is controlled by the place feature, while the color is determined by the appearance feature.

Datasets
To validate the proposed method on visual place recognition task, we test our network on three public datasets: Partitioned Nordland dataset, 32 Alderley Day/Night dataset, 7 and RobotCar-Seasons dataset. 68 Besides, we also experiment on a new dataset, YQ Day/Night dataset, which is collected in a campus environment with changing appearance and will be opened to the public (https://tan gli.site/projects/academic/yq21). Partitioned Nordland dataset. It is collected from a train on the same route (729 km) in four seasons (spring, summer, fall, and winter) with GPS data. In this article, four seasons are treated as four domains. The original Nordland dataset is used by many researches, 69,42 but they use different partitions for training and testing. This dataset proposes a reasonable partition of Nordland, where the whole route is partitioned into five segments, with two as training set and three as testing set. The training and testing sets have 24,570 and 3450 images for each domain, respectively. The GPS information is accurate enough to align between different domains. This dataset provides images without perspective changes and high-quality ground truth, thus it is useful for validating the ability to overcome appearance changes ( Figure 5(a)).
Alderley Day/Night dataset. It is captured from a camera mounted on a car. It is constituted of two domains, one daytime and one nighttime, on an 8 km journey. As mentioned in their article, the GPS data are not reliable enough, thus the images are manually aligned frame by frame. 7 However, as the images are collected in an urban environment, the vehicle is moving with lateral and heading changes, which makes the alignment less reliable. Besides, the apparent differences between those two sessions are really large, making it challenging for place recognition. From Figure 5(b), one can see that sometimes it is even challenging for humans. Each domain in the dataset is split as a training and testing set with 10,007 and 4600 images, respectively. 32 RobotCar-Seasons dataset. It is based on the Oxford Robot-Car dataset, which is recorded on a vehicle with six cameras under different conditions in the urban environment, with other sensors including INS, GPS, and LiDAR. 70 RobotCar-Seasons selects a subset of RobotCar dataset, with one reference traversal in overcast condition (overcast-reference) as database and nine traversals for query in different conditions. 68 The query set is further split as all-day and all-night, where all-day refers to images collected in the daytime, while all-night refers to images collected in the nighttime. In this article, only the all-night subset is used, as we want to validate our method in different domains. Thus, two domains, namely overcast-reference and all-night, are used for training and testing. The ground truth is obtained from large-scale structure from motion and initialized with INS data. One thing that should be mentioned is that the ground truth is only used in the testing phase, not in the training process. This dataset provides challenges in autonomous driving including appearance changes, perspective difference, and motion blur ( Figure 5(c)).

YQ Day/Night dataset.
It is a subset of the YQ dataset. 62 The original YQ dataset has 21 sessions, out of which we choose two for the place recognition task in this article. These two sessions become two domains in this dataset, namely day and night. The first session is collected in the morning, while the second one is collected in the evening ( Table 1). The evening traversal is collected at the time when it is turning from day to night. Although this traversal only lasts for about 19 min, it contains a different appearance. Thus, it can be used to validate the robustness of algorithms against dynamic appearance changes in a short period. Sample images are shown in Figure 5(d). The ground truth is obtained from the LiDAR SLAM results. We split the trajectory into two folds, with the 60% beginning being the training set and the 30% ending as the testing set. The training set is sampled every 0.1 m, and the testing set is sampled every 2 m. As the data are recorded on a mobile robot controlled by a remoter at low speed, it can be seen as a representative in mobile logistics in a small area, with appearance and perspective changes.

Evaluation metrics and training details
We choose two widely used metrics for visual place recognition in the following experiments (except for RobotCar-Seasons which will be discussed later): area under curve (AUC) and accuracy (true positive rate). After the training is done, the place features are used as global descriptors for matching. The goal of visual place recognition is to find the nearest image to a given query image x Q in database q DB;i ; i 2 N DB , where N DB is the number of database images. This article adopts the brute force matching without any temporal fusion to demonstrate the discriminative ability of our feature. Specifically, we extract place features for database images using the trained place encoder, namelŷ s DB;i ¼ E S ðx DB;i Þ, resulting in the database feature set fŝ DB;i g. The place featureŝ Q for the query image x Q can also be extracted using E S . Finally, the nearest image x m is determined by the cosine distances betweenŝ Q andŝ DB;i Before computing the distances, all the features, includingŝ Q andŝ DB;i , are flattened as vectors. In Eq. (17), we normalize the input vector because it can increase the robustness of the features against illumination changes by normalizing the magnitude of the features.
Our network is trained with l 1 ¼ 0:003 and l 2 ¼ 0:01 in all visual place recognition tasks. The exception is that in the ablation study, these two parameters are set to zeros somewhere. All models of our method are trained for 100 epochs, except for that model of the YQ Day/Night dataset, which is trained for 50 epochs. It is because the number of images in YQ is too small, and the model gets overfitted too quickly.
Comparison methods. To illustrate the performance of our method on visual place recognition, several methods are selected as a comparison: Three methods based on handcrafted features (DBoW 2 , HOG, and DenseVLAD) are chosen as representatives of traditional methods. 24,27,55 For supervised methods, NetVLAD and method by Facil et al. are used to show their advantages and limitation. 10,32 As NetVLAD has opened the code for training NetVLAD, we also retrain NetVLAD on these datasets to see the improvement obtained from supervision. All settings follow the original article of NetVLAD, except that when training YQ Day/Night, the "nNegChoice," "nTestSample," and "nTestRankSample" are set as 20, 20, and 100, respectively, because the dataset is small. Besides, two style-transfer-based algorithms are also selected. 12,71 RobotCar-Seasons dataset uses different evaluation metrics. As described by the benchmark, RobotCar-Seasons dataset measures the performance using percentages of query images localized within three error tolerance thresholds. 68

Ablation study
Our method introduces two new adversarial losses to the autoencoder. To see whether the new losses work, we conduct an ablation study in this section. We selected the most challenging pair in Partitioned Nordland, namely winter and spring, as target domains. The experiment follows the same settings described in the last subsection, except the hyperparameters l 1 and l 2 . The baseline is the autoencoder, where l 1 and l 2 are set as zeros. To see the necessity of D pla and D app , D app is added to the baseline (l 1 ¼ 0:003 and l 2 ¼ 0), followed by adding D pla (l 1 ¼ 0:003 and l 2 ¼ 0:01). Finally, augmentation is added as a comparison.
Results are displayed in Table 2. We can see that the complete network (with D pla , D app , and augmentation) is much better than the baseline (pure autoencoder). By adding D app , the network gets slight improvement (0.06 in AUC and 9.9 percent points in accuracy), while by adding D pla further, more improvements are obtained (0.21 in AUC and 11.5 percent points in accuracy). Finally, by adding augmentation, we can have a further performance improvement. Thus, we can safely conclude that the introduced losses are effective in visual place recognition.

Two domains
This section investigates the performances of different methods in the scenario with two domains. For Partitioned Nordland dataset, only the winter and spring sessions are chosen as two domains.
Experimental results are listed in Tables 3 and 4. Table 3 presents the performance of different methods on three of the mentioned datasets: Partitioned Nordland, Alderley Day/Night, and YQ Day/Night. To evaluate our method on RobotCar-Seasons dataset, we submit our results to the Visual Localization Benchmark (https://www.visuallocali zation.net/benchmark). In Table 4, we also consider several methods on that benchmark as a comparison (only published place recognition algorithms are selected).  From Table 3, we can see that methods based on handcrafted features are not good at place recognition tasks with extreme appearance changes. For learning-based, the supervised methods outperform self-supervised ones in some dataset (Partitioned Nordland) as expected. However, our method is better than those supervised methods on some datasets (Alderley Day/Night and YQ Day/Night). There are two possible reasons. One is that supervised methods depend on the quality of alignment. As we described before, the alignment of Alderley Day/Night is not good enough somewhere, which may degrade the performance for supervised methods. Another reason is that the training set and testing set of YQ Day/Night have subtle differences as described above, which request for generalization ability for algorithms. Generally speaking, self-supervised methods outperform supervised methods in the sense of generalization.
Compared with other self-supervised methods, our method achieves comparable results. It shows the best performance on Partitioned Nordland, Alderley Day/Night, and YQ Day/Night. Combine with Table 4, we can find that our method and ToDayGAN perform better than other self-supervised methods on different datasets. One main difference between our method and ToDayGAN is that in our method the adversarial learning is applied directly to features, while ToDayGAN is on images. The transferred image by ToDayGAN looks realistic, but sometimes it still looks different from the original style image (the reader can see transferred images in the appendix). The reason is that in style transfer the same input image can generate different output. Especially, Partitioned Nordland has a slight appearance difference in the same domain, resulting in the miss-matching problem. Conversely, the appearance of the same domain in RobotCar-Seasons is quite stable, that is why ToDayGAN outperforms our method on RobotCar-Seasons in some criteria.
To see the improvement brought by perceptual loss, we also trained our network with L2 reconstruction loss (Our-L2 in Table 3). From Table 3, we can see that perceptual loss has an obvious increase compared to L2 loss. This  is generated from place feature of (a) and an all-zero appearance feature. Columns 1 to 5: X 1 is winter and X 2 is spring and columns 6 to 10: X 1 is spring and X 2 is winter. (a) Input images from the first domain (X 1 ); (b) input images from the second domain (X 2 ); (c) translated images from X 1 to X 2 ; and (d) zero appearance images of X 1 .
finding illustrates that the learned features can improved by reconstructing high-level features instead of reconstructing original images.
We also try to explore what does the network learns by generating images from cross-domain features. For specific, we feed place features from one domain (Figure 6(a)) with appearance features from the other domain ( Figure 6(b)) into our network to see what will be generated from the decoder G (Figure 6(c)). In Figure 6(c), one can find that the place information (e.g. tracks, buildings, and traffic lights) is determined by images in the first row ( Figure 6(a)), while the appearance information (e.g. color of the ground, illumination) is determined by the second row (Figure 6(b)). This result satisfies our motivation: the place information is embedded in the place feature, while the appearance feature controls the appearance of the reconstructed images. To further illustrate what is learned in the place feature, we replace all the appearance features with all-zero vectors ( Figure 6(d)). All the zeroappearance images look to have a similar appearance, while their place information remains the same as the first row ( Figure 6(a)). These two findings demonstrate that the proposed method can disentangle the input image across appearance changes.

Multiple domains
One novelty of our network is that it is designed to be trainable with multiple domains without additional parameters. The benefit is that when new data from different domain come, we can retrain our network without increasing model capacity. We use Partitioned Nordland dataset to illustrate this point as it has four domains. Firstly, for every two domains, we train a network and evaluate it on the testing set. In this process, we have 12 models in total. Secondly, we train a unified model with four domains as the training set and then evaluate it on every two domains. In this stage, only one model is obtained. Table 5 is the comparison results between two-domain models and multiple-domain models. We can see that by fusing more domains, our network achieves comparable performance to the two-domain models. However, the two-domain network needs 12 models, while the multiple-domain network needs only 1 model. It means our method can be extended to more domains while keeping the same model complexity. This is very useful when deploying deep learning networks in an environment with changing appearance.

Network architecture
Networks for the linear case, colored MNIST, and place recognition are listed in Tables 6 to 8, respectively. Conv-(Nn, Kk, Pp, Ss) denotes a convolution with output channels of n, kernel size of k Â k, padding of p, and stride of s. Two types of pooling are used, including max-pooling (MaxPool) of kernel size 2 Â 2 and average-pooling (Avg-Pool) of kernel size 8 Â 8. Linear-(n) represents a fully connected layer with n outputs. Decoders upsample feature maps using the nearest neighbor algorithm (upsample).
Each layer can be optionally followed by normalization function and activation function. Normalization functions include instance normalization 72 and layer normalization 73 . Activation functions include rectified linear unit (ReLU), LReLU (with a negative slope of 0.2), and Tanh. The concat operation tries to fuse two feature maps. For input with the same spatial dimensions, such as H Â W Â C 1 and H Â W Â C 2 , the output of concat is determined by concatenating these two features along the last axis. Thus, the size of the output is H Â W Â ðC 1 þ C 2 Þ. If input feature maps have different spatial dimensions, such as H Â W Â C 1 and 1 Â 1 Â C 2 , the smaller one will be repeated along two spatial axes firstly to match with the larger one, resulting a H Â W Â C 2 feature map. Then, it is concatenated with the larger one, with a H Â W Â ðC 1 þ C 2 Þ feature map as output.