Self-supervised learning for using overhead imagery as maps in outdoor range sensor localization

Traditional approaches to outdoor vehicle localization assume a reliable, prior map is available, typically built using the same sensor suite as the on-board sensors used during localization. This work makes a different assumption. It assumes that an overhead image of the workspace is available and utilizes that as a map for use for range-based sensor localization by a vehicle. Here, range-based sensors are radars and lidars. Our motivation is simple, off-the-shelf, publicly available overhead imagery such as Google satellite images can be a ubiquitous, cheap, and powerful tool for vehicle localization when a usable prior sensor map is unavailable, inconvenient, or expensive. The challenge to be addressed is that overhead images are clearly not directly comparable to data from ground range sensors because of their starkly different modalities. We present a learned metric localization method that not only handles the modality difference, but is also cheap to train, learning in a self-supervised fashion without requiring metrically accurate ground truth. By evaluating across multiple real-world datasets, we demonstrate the robustness and versatility of our method for various sensor configurations in cross-modality localization, achieving localization errors on-par with a prior supervised approach while requiring no pixel-wise aligned ground truth for supervision at training. We pay particular attention to the use of millimeter-wave radar, which, owing to its complex interaction with the scene and its immunity to weather and lighting conditions, makes for a compelling and valuable use case.


Introduction
The ability to localize relative to an operating environment is central to robot autonomy. Localization using range sensors, such as lidars (Levinson and Thrun, 2010;Wolcott and Eustice, 2015) and, more recently, scanning millimeterwave radars Park et al., 2019;Saftescu et al., 2020), is an established proposition. Both are immune to changing lighting conditions and directly measure scale, while the latter adds resilience to weather conditions. Current approaches to robot localization typically rely on a prior map built using a sensor configuration that will also be equipped on-board, for example a laser map for laser-based localization. This article looks at an alternative method. Public overhead imagery such as satellite images can be a reliable map source, as they are readily available, and often capture information also observable, albeit perhaps in some complex or incomplete way, by sensors on the ground. We can pose the localization problem in a natural way: finding the pixel location of a sensor in an overhead (satellite) image given range data taken from the ground. The task is, however, non-trivial because of the drastic modality difference between satellite images and sparse, ground-based radar or lidar.
Recent work on learning to localize a ground scanning radar against satellite images by Tang et al. (2020b) provides a promising direction which addresses the modality difference by first generating a synthetic radar image from a satellite image. The synthetic image can then be ''compared'' against live radar data, expressed as 2D images from a ''bird's eye'' perspective, for pose estimation. Such an approach learns metric, cross-modality localization in an end-to-end fashion, and therefore does not require handcrafted features limited to a specific environment.
The method in Tang et al. (2020b) trains a multi-stage network, and needs pixel-wise aligned radar and satellite image pairs for supervision at all stages. This, in turn, relies on sub-meter and sub-degree accurate ground-truth position and heading signals, which in practice requires high-end GPS/inertial navigation system (INS) and possibly bundle adjustment along with other on-board sensor solutions, bringing in burdens in terms of cost and time consumption.
To address this issue, building on the work of Tang et al. (2020b), we propose a method for localizing against satellite imagery that is learned in a self-supervised fashion. The core idea is still to generate a synthetic image with the appearance and observed scenes of a live range sensor image, but pixel-wise aligned with the satellite image. Yet, we relax the requirement on pixel-wise aligned data pairs and assume only a coarse initial pose estimate is available from a place recognition system, such that there is reasonable overlap between the live ground sensor field of view and a queried satellite image. Our method does not solve the global localization problem. Instead, given a coarse initial pose estimate from place recognition, our method solves the metric localization of a range sensor using overhead imagery, providing a refined SE(2) metric pose parametrized as x y u ½ T : Vitally, here we make no use of metrically accurate ground truth for training. Note also that although designed for localizing against satellite imagery, our method can naturally handle other forms of cross-modality registration, such as localizing a radar against a prior lidar map. Figure 1 shows synthetic images generated by our method used for pose estimation.
To the best of the authors' knowledge, our proposed method is the first to learn the cross-modality, metric localization of a range sensor in a self-supervised fashion. Our method is validated experimentally on multiple datasets and achieves performances on-par with a state-of-the-art, supervised approach. Even though our method does not solve the global localization problem and instead relies on an external place recognition system, it can nevertheless be utilized to greatly refine the sensor's metric pose starting from a coarse initial pose estimate, all in the absence of any prior sensor maps.
This article is an extended version of our prior work (Tang et al., 2020a). The improvements include a more detailed explanation for the motivation of our method to tackle the problem of localizing range sensors using satellite imagery (Section 3) and a more thorough description of our method (Section 4). For experimental validation (Section 6), we present additional qualitative results, an ablation study with reduced training data, a study on the trade-offs between network width and depth and solution quality, and an analysis on the choice of image resolution. We also introduce an introspective strategy at inference time to handle initial pose offsets larger than in the training data. Finally, we show with unsupervised domain adaptation, models trained using radar data can be utilized for localization between lidar and overhead imagery, and vice versa.

Related work
Our approach is related not only to other works in the field of localization using overhead imagery and the general theme of cross-modality localization, but also to learned methods for range sensor state estimation and unsupervised image generation. We provide a broad coverage of the most relevant research in these subjects in this section.

Localization using overhead images
Localization using aerial or overhead images has been of interest for the community for over a decade. The methods in Leung et al. (2008), Li et al. (2014), and Parsley and Julier (2010) localize a ground camera using aerial images, by detecting Canny edges from aerial imagery, and matching against lines detected by a ground camera. Several other vision-based approaches project the ground camera images to a top-down perspective via a homography, and compare against the aerial imagery by detecting lane markings (Pink, 2008), Speeded Up Robust Features (SURF) (Noda et al., 2010), or dense matching (Senlet and Elgammal, 2011). Recent work by Chebrolu   Given a map image of modality A (left) and a live data image of modality B (middle), we wish to find the unknown SE(2) offset between them. To do so, our method generates a synthetic image of modality B (right) that is pixel-wise aligned with the map image, but contains the same appearance and observed scenes as the live data image. Top: localizing radar data against satellite imagery. Middle: localizing lidar data against satellite imagery. Bottom: localizing radar data against prior lidar map.
ground robot in a crop field by matching camera features against landmarks from an aerial map, and explicitly incorporates semantics of crops to reduce ambiguity.
Metric localization of range sensors or point-clouds against overhead imagery requires further pre-processing owing to the modality difference. Kaminsky et al. (2009) projected point-clouds into images and matched against binary edge images from overhead imagery. The method of Kaminsky et al. (2009) also constructs a ray image by raytracing each point, and introduces a free-space cost to aid the image registration. The work by de Paula Veronese et al. (2015) accumulates several lidar scans to produce dense lidar intensity images, which are then matched against satellite images utilizing normalized mutual information. Similar to Kaminsky et al. (2009), several other methods also pre-process the aerial image before matching against ground laser observations, for example using edge detection (Kümmerle et al., 2011) or semantic segmentation (Dogruer et al., 2010). In contrast to these approaches, our method directly learns the metric localization of a range sensor end-to-end, without the need for careful preprocessing or manual feature definition.
Closely related to our method is the seminal work on learning to localize a ground radar against satellite imagery by Tang et al. (2020b). As discussed previously, the method in Tang et al. (2020b) requires pixel-wise aligned ground truth for supervision, whereas our method is selfsupervised.

Cross-modality localization
Other forms of cross-modality localization have also been heavily studied by the community. Several works propose to localize a forward-facing camera against a prior 3D point-cloud map (Caselitz et al., 2016;Wolcott and Eustice, 2014;Xu et al., 2017). Carle and Barfoot (2010) localized a ground laser scanner against an orbital elevation map. The works in Wang et al. (2019), Boniardi et al. (2017), Wang et al. (2017), andMielle et al. (2019) localize an indoor lidar or stereo camera against architectural floor plans. Recently, Yin et al. (2021) proposed a joint learning system for radar place recognition using a prior lidar database, and achieves state-of-the-art results on the Oxford Radar RobotCar Dataset  and MulRan Dataset (Kim et al., 2020).
OpenStreetMap is a particularly useful publicly available resource for robot localization. Brubaker et al. (2013) and Floros et al. (2013) concurrently proposed matching visual odometry paths to road layouts from OpenStreetMap for localization. Ruchti et al. (2015) proposed a road classification scheme to localize a ground lidar using OpenStreetMap. Yan et al. (2019) utilized networks pretrained for point-cloud semantic segmentation, and built a light-weight descriptor to recognize intersections and gaps, and compared against the descriptors of the operating environment built using OpenStreetMap.

Learning-based state estimation for range sensors
A number of recent works were proposed for learning the odometry or localization of lidars. Barsan et al. (2018) represented lidar data as intensity images, and learned a deep embedding specifically for metric localization that can be used for direct comparison of live lidar data against a previously-built lidar map. Other methods such as Cho et al. (2019); Li et al. (2019), instead, learn deep lidar odometry by projecting lidar point-clouds into other representations before passing through the network. Lu et al. (2019b) learned descriptors from input point-clouds, and utilized 3D CNNs for solving SE(2) metric localization by searching in a 3D cost volume. In their later work, Lu et al. (2019a) proposed a method to learn SE(3) lidar point-cloud registration end-to-end. Recently, OverlapNet (Chen et al., 2020) was proposed to learn lidar loop-closure detection based on the overlap between bird's eye view lidar images.
As an emerging sensor for outdoor state estimation, learning-based methods were proposed for scanning frequency-modulated continuous-wave (FMCW) radars. Aldera et al. (2019) utilized an encoder-decoder on polar image representation of radar scans to reject superfluous points for decreasing computation time in the classical radar odometry method described by Cen and Newman (2019). Barnes et al. (2019) learned image-based radar odometry by masking out regions distracting for pose estimation.  learned point-based radar odometry by detecting key points from radar images. Saftescu et al. (2020) encoded images of polar radar scans through a rotation-invariant architecture to perform topological localization (place recognition), which can then be used for querying a previously built map . These methods, however, are designed to compare data of the same sensor type, and do not address modality difference. Our approach is similar to Barsan et al. (2018), Cho et al. (2019), Aldera et al. (2019), Barnes et al. (2019), Weston et al. (2019), Saftescu et al. (2020), , and Broome et al. (2020) in that we also represent range sensor data as 2D images.

Unsupervised image generation
A fundamental step in our approach is the generation of a synthetic image before pose computation, where there is no pixel-wise aligned target image for supervision. CycleGAN (Zhu et al., 2017) achieves unsupervised image-to-image transfer between two domains X and Y, by learning two pairs of generators and discriminators, and enforcing cycleconsistency when an image is mapped from X to Y and back from Y to X , and vice versa. Other methods (Lee et al., 2018;Liu et al., 2017) also utilize cycle-consistency but make different assumptions on how the latent spaces of the two domains are treated. Whereas these methods are concerned with generating photo-realistic images, we are interested in the problem of metric localization. As such, we need to explicitly encourage the synthetic image to contain information appropriate for pose estimation, rather than aiming for photo-realistic reconstruction.
Several prior works are also geometry-aware. The methods by Shu et al. (2018), Wu et al. (2019), and Xing et al. (2019) use separate encoders and/or decoders to disentangle geometry and appearance. The results are networks that can separately interpolate the geometry and appearance of the output images. Similarly, our method separately encodes information about the appearance and the relative pose offset, resulting in an architecture where the two are explicitly disentangled.

Overview and motivation
We seek to solve for the SE(2) pose between a map image of modality A and a live data image of modality B: Our main focus is when modality A is satellite imagery, whereas modality B are range sensor data represented as an image.
Previously, Radar-Satellite Localization Network (RSL-Net) (Tang et al., 2020b) was proposed to solve for the metric localization between matched pairs of radar and satellite images. In particular, it aims to generate a synthetic image that preserves the appearance and observed scenes of the live radar image, and is pixel-wise aligned with the paired satellite image. The synthetic image and the live radar image are then projected onto deep embeddings, where their pose offset is found by maximizing a correlation surface. We follow the same general approach, but, unlike RSL-Net, our method learns in a self-supervised fashion.

Hand-crafting features versus learning
Some of the works listed in Sections 2.1 and 2.2 can achieve decent accuracy on localizing a ground range sensor against aerial imagery. However, they typically rely on pre-processing the aerial images using hand-crafted features or transformations designed for a specific set-up and may not generalize to other sensors or different environments. For example, Kümmerle et al. (2011) focus on detecting edges from a campus dominated by buildings. de Paula Veronese et al. (2015) directly match accumulated lidar intensity images against aerial imagery without preprocessing, yet this is inappropriate for radars due to the complexity of their return signals.
Our data-driven approach instead learns to directly infer the geometric relationship across modalities, remaining free of hand-crafted features. We show in Section 6 the robustness of our method when localizing against satellite imagery in various types of scenes, including urban, residential, campus, and highway ( Figure 2).

Generating images versus direct regression
A naive approach would be to take a satellite image and a live data image as inputs, and directly regress the pose. Yet, as shown in Tang et al. (2020b), this led to poor results even for the supervised case. Our hypothesis is that when the two images are starkly different in appearance and observed scenes, the problem becomes too complex for direct regression to succeed given current techniques.
Generating synthetic images prior to pose estimation brings two advantages over directly regressing the pose. First, it is a simpler and less ill-posed problem than directly regressing the pose, particularly because we can utilize the live data image to condition the generation. Second, the image generation loss is distributed over an entire image of H × W pixels, where H and W are height and width, instead of on just three pose parameters that describe an SE(2) displacement (x, y, and u), introducing greater constraints during optimization.

Conditional image generation
We tackle the synthesis of an image of the live data modality B from one of the map modality A as a conditional image generation task, that is, taking both a map (e.g., satellite) image and a live data image as inputs. An alternative approach is to learn a domain adaptation directly from A to B, without conditioning on the live data image, for example, standard image-to-image transfer such as CycleGAN (Zhu et al., 2017).
In practice, the map (e.g., satellite) image is a denser representation of the environment than a frame of data captured by a range sensor. Only a fraction of the scenes captured in a satellite map is present in a ground sensor field of view, resulting in the scan to appear drastically different depending on the sensor pose. In other words, the mapping from a satellite image to a range sensor image is not oneto-one, but one-to-many, as illustrated in Figure 3. Figure 4 demonstrates this concept on real data: the overlapping regions of the two satellite images are identical, whereas the two radar images observe different portions of the scene and as such appear drastically different.
By using a naive image-to-image transfer approach, there is no guarantee for the generated image to contain regions of the scene that are useful for pose comparison against the live data image. Figure 5 shows examples of images generated using CycleGAN (Zhu et al., 2017), where the synthetic image highlights different scenes than what are observed by the live data image. The issue with observability or occlusion can potentially be handled by ray-tracing, such as in Kaminsky et al. (2009). However, not only is this computationally expensive, it does not apply to FMCW radars which have multiple range returns per azimuth (see  and Cen and Newman (2019) for more details on the sensing characteristics of FMCW radars).
Our approach inherently addresses this problem: by conditioning the image generation with the live data image, we can encourage the synthetic image to capture regions of the scene also observed by the live data image, as shown in Sections 4 and 6. This concept is analogous to learning the mapping on the right of Figure 3, where, by using a pair of satellite and range sensor images as input, the regions of the scene to be present in the output synthetic image is no longer ambiguous, but constrained by the input range sensor image.

Self-supervised cross-modality localization
Our localization pipeline is composed of three steps: rotation inference, image generation, and pose estimation. We discuss them in detail in this section.

Rotation inference
Given a paired map (e.g., satellite) image A 2 A and live data (e.g., radar or lidar) image B 2 B pre-scaled to have the same resolution but with an unknown SE(2) offset, we seek to generate a synthetic image that contains the same appearance and observed scenes as B, but is pixel-wise aligned with A: Let the SE(2) pose difference between A and B be parametrized as x y u ½ T , such that by rotating B by u and then translating by x y ½ T , one can pixel-wise align B onto A: The image generation can be formulated as Fig. 3. A one-to-many mapping (left) versus a one-to-one mapping (right). Left: the mapping from modality A to modality B preserves color, but is ambiguous in orientation of the output, resulting in a one-to-many mapping, and is therefore not a function. Right: augmenting the input with an element of B offers additional constraint in orientation, resulting in a one-toone mapping as the mapping is now unambiguous in both color and orientation. Note that the mapping on the right is one-to-one, but not necessarily surjective.  There is no explicit constraint on which regions of the input satellite image will appear in the output synthetic image. As a result, this leads to large localization error as the synthetic image does not contain scenes observed by the live radar image.
where a = x y ½ T : HereB u, a is a generated image of modality B that synthesizes the input live sensor image B applied with a rotation of u, followed by a translation of a = x y ½ T : Thus,B u, a is pixel-wise aligned with the input map image A, but contains the same observed scenes as B: However, as originally noted by Tang et al. (2020b), the mapping in (1) is difficult to learn as the inputs A and B are offset by both a translation and a rotation. CNNs are inherently equivariant 1 to translation, but not to rotation (Lenc and Vedaldi, 2015). As a result, the CNNs in the network cannot automatically utilize their mutual information and thereby capture their geometric relationship.
The method in Tang et al. (2020b) proposes to infer the rotation prior to image generation, namely, reducing (1) to two steps: Here f R is a function that infers the rotation offset u between A and B, and outputs B u , which is input image B rotated by u: Now, B u is rotation-aligned with the map frame, and therefore offset with A only by a translation, which CNNs can naturally handle. f G is an image generation function that produces the synthetic imageB u, a : The experiments in Tang et al. (2020b) show that learning (2) and (3) sequentially resulted in better performance than learning (1) directly, as the former is congruous with the equivariance properties of CNNs.
In Tang et al. (2020b), the rotation inference function f R is parametrized by a deep network as shown in Figure 6, where satellite imagery and radar images are used as an example. Given a coarse initial heading estimate, the live data image B is rotated a number of times with small increments to form a stack of rotated images fBg = fB u 0 , B u 1 , . . . , B u n g, where the number of rotations n and the increment are design parameters. Each rotated image is further concatenated with the map image to form a stacked tensor input of n pairs of map and live data images. The network assigns a latent score per input pair, and outputs a softmaxed image from fBg where the softmax weights are a function of learned latent scores: where each z i is the associated learned scalar latent score for the pair fA, B u i g: A loss function enforces the output to correspond to B rotated to be rotation-aligned with A, namely B u : The core idea is that the network f R will assign a large softmax weight to the image from fBg whose heading most closely aligns with the map image A, and small weights to all other images in fBg.
If a metrically accurate heading ground truth u is available, then one can rotate B to form a ground-truth image target to B u used for supervising the rotation inference, as in Figure 6. In this work, however, we assume this is never the case, thus the network for f R must learn to infer the rotation offset in a self-supervised fashion.
For this reason, while following the same architecture as Tang et al. (2020b), our method for inferring rotation uses a different training strategy that enables self-supervised learning. In order for the network f R to produce the correct output, it must be able to infer the rotation from the solution space fBg, despite the modality difference between map image A and live data image B: We make the observation that if the network can infer the rotation offset from a stack of rotated live data images fBg, then, given a live data image B u i , f R should also be able to output A u i from a stack of rotated map images fAg, where A u i is rotationaligned with B u i : Specifically, if we have B u i = B u , then the softmaxed map image from fAg should be A, as A and B u are rotation-aligned.
As such, to learn rotation inference self-supervised, we need to pass through the network f R twice. The first pass is identical as in the supervised approach in Figure 6, where we denote the output softmaxed image as B u i : Then B u i is used as input to the second pass through network f R , together with a stack of map images fAg = fA, The rotation angles f 0 f 1 . . . f m ½ can be chosen randomly, and the order of fAg is shuffled such that the original non-rotated map image A can be at any index within fAg: Each image is concatenated with B u i to form the input stack for passing through f R the second time. Note that the same network f R is used in both passes.
The network is supervised with an L 1 loss that enforces the output of the second pass to be the non-rotated map image A: where B u i is the output of the first pass. Minimizing the loss in (6) in turn enforces B u i to be B u , as B u is rotationaligned with A: Our approach is shown in Figure 7. We use an increment of 2 8 when forming the rotation stack fBg: We evaluate the effect of the rotation increment on solution error in Section 6.10 and justify our choice. The estimate for the rotation offset,û, can then be found from the arg-softmax for the rotation stack fBg:

Image Generation
Given A and B u we seek to generate a synthetic imageB u, a as in (3), whereB u, a is pixel-wise aligned with A: Tang et al. (2020b) learns the image generation function by a supervised approach, concatenating A and B u , and applying an encoder-decoder architecture, as shown in Figure 8. This is possible because a target for the synthetic imageB u, a can be obtained by applying the ground-truth transform.
In the supervised approach in Tang et al. (2020b), a loss can be formed between the synthetic image and the target: where B u, a is the target with ground truth transform.
To generate synthetic images self-supervised, we propose an architecture we call PASED, shown in Figure 9. PASED is trained in two steps: the first is a pre-training, intra-modality process that can be supervised with ground truth image targets (top half of Figure 9), whereas the second handles cross-modality comparison (bottom half of Figure 9). 4.2.1. Pre-training step. Taking two random images B 1 and B 2 in the live data modality B from the training set, where B 1 and B 2 can be at arbitrary heading, we apply a known translation offset g 2 R 2 to B 2 : This forms an image B 2 g that is a shifted version of B 2 : We pass B 1 through an appearance encoder E a that encodes its appearance and observed scenes. Then B 2 g and B 2 are passed as inputs to a pose encoder E p that encodes the translation offset between the input images. The latent spaces from E a and E p are combined before passing through a decoder D, which outputs a synthetic imageB 1 g that is B 1 shifted by a translation g: In other words, PASED discovers the translation offset between the two images passed as input to E p , and applies the latent translation encoding to the input image of E a : The pre-training can be supervised as g is known, thus we can shift B 1 by g to produce the target B 1 g : The loss function can be formulated as and the parameters for networks E a , E p , and D can be optimized by minimizing (9).  7. Given A and a rotation stack fBg the network f R finds B ui by taking softmax. Then, given B ui and a rotation stack fAg, the network outputs a softmaxed map image from fAg: A loss is applied to enforce the output of the second pass to be A, which in turn enforces the output of the first pass to be B u : Here both symbols for f R in the figure refer to the same network with the same parameters, but at different forward passes. The fact that we use different images B 1 and B 2 for inputs to E a and E p ensures appearance and pose are disentangled from each other. As shown later, this allows modules of PASED to be separated and re-combined with newly learned modules.

Cross-modality step.
In the second step, we fix the weights of E a , E p , and D which are optimized from the pre-training step. This narrows down the self-supervision problem to learning a cross-modality pose encoder E Ã p that discovers the translation offset between an image of modality A and another of B: Taking A and B u as inputs, E Ã p should encode the unknown translation offset a between them. Concurrently, B u is fed to E a to encode its appearance and the resulting latent space is combined with the latent space produced by E Ã p (A, B u ), before being decoded by D: This encoder-decoder combination will generate a synthetic imageB u, a , which we do not have a target for: We can apply a known shift to the center position of A to query another map image A 0 , where A 0 is offset with B u by an unknown translation b: Using the same encoderdecoder combination as before, we can take A 0 and B u to generate a synthetic imageB u, b : Furthermore, given B u and the networks learned from pre-training, we can easily generateB u by encoding a zero shift:B If we passB u andB u, a to the pre-trained pose encoder E p , then the latent space will encode a shift of Àa: Combing this latent space with E a (B u, b ), we can decode a synthetic imageB u, bÀa : Here b À a is a known value as it is the translation offset applied to A to obtain A 0 : We can shift B u by b À a to obtain B u, bÀa : Using B u, bÀa and B u , we can generateB u, bÀa with pre-trained networks E a , E p , and D, shown on the bottom right of Figure 9. Specifically, this can be expressed as We can formB u, bÀa using two different combinations of inputs as in (13) and (14). Notably,B u, bÀa formed using (14) only passes through networks with weights optimized from the pre-training step and fixed during the crossmodality step, and therefore can be used as a target. A loss can then be established between the two synthetic images from (13) and (14), where the latter is a target image: By back-propagation, the loss in (15) optimizes the network E Ã p , asB u, a andB u, b are functions of E Ã p : Fig. 9. Top: During pre-training, we can learn an appearance encoder E a , and a pose encoder E p that discovers the translation offset between an image of B and a shifted version of itself. Bottom: Taking E a , E p , and D and fixing their weights, we seek to learn E Ã p which discovers the translation offset between two images from different modalities. Here E a , E p , and D can provide the necessary geometric and appearance relationships used for learning E Ã p self-supervised.
Alternatively we can use B u, bÀa as the target, but, in practice, usingB u, bÀa as in (14) led to faster convergence. For the loss in (15) to be minimized, two conditions must hold true. First,B u, b must have correctly encoded the appearance and observed scenes in B u : Second,B u, a and B u, b must have the correct translations a and b, respectively. By satisfying these two constraints we can ensure E Ã p is able to discover the translation offset across modalities, and is compatible with pre-trained networks E a and D for image generation.

Pose estimation
Taking the pose-aligned synthetic imageB u, a and the rotation-aligned range sensor image B u , we embed them to a joint space, where their translation offset is found by maximizing correlation on the learned embeddings. We denote the embedding network for real and synthetic images to be H B and HB, respectively, and the learned embeddings to be B y u andB y u, a : The correlation maximization is a parameter-free process that requires no additional learned modules, but is differentiable allowing gradients induced by the downstream loss to propagate to upstream learned modules. In this step, we can inferâ =xŷ ½ T , which is our posterior estimate to the translation. Formally, this can be expressed aŝ whereB y u, a HB y u is the discrete cross-correlation betweeñ B y u, a and B y u : This can be performed efficiently in the Fourier domain, as is done in prior works that use a similar approach (Barnes et al., 2019;Barsan et al., 2018;Tang et al., 2020b).
The embeddings are thus learned to further ensure the synthetic image and the live image can be correlated correctly. Without ground truth a, we can self-supervise using a similar approach as in learning PASED, by applying a known shift. The architecture for learning the embeddings is shown in Figure 10. Similar as in Section 4.2,B u, b can be obtained by shifting the map image A to obtain A 0 : Given learned deep embeddingsB y u, b and B y u , the translation offset by correlation maximization is found to beb : The difference of the two offsets b À a is known, and can be used to establish a loss term: The overall pipeline for data flow at inference time is shown in Figure 11.

Implementation details
Here we provide details on the architecture of the various networks used in our method, and the associated hyperparameters. We make use of the following abbreviations. RP(p): 2D reflection padding of p. Conv(C in , C out , k, s, p): convolution with C in input channels, C out output channels, kernel size k, stride s, padding p, and bias. IN: instance normalization. ReLU: rectified linear unit. LReLU(m): leaky ReLU with negative slope m. Drop(d): dropout with ratio d. ConvT(C in , C out , k, s, p, p out ): transposed convolution with C in input channels, C out output channels, kernel size k, stride s, padding p, output padding p out , and bias.
The network architectures are listed in Tables 1 to 4. For comparison against the prior supervised approach, we use the same architectures where possible. We implemented the image generation network for the prior supervised approach to have the same latent space size at the bottleneck, and the same number of down-sample and up-sample layers as in our method.
Our method is implemented in PyTorch (Paszke et al., 2019). For training rotation inference f R and networks for image generation E a , E p , E Ã p , and D, we use a learning rate of 2e À4 : For learning the embedding networks H B and HB, we use a learning rate of 2 × 10 À6 : We use Adam (Kingma and Ba, 2015) as the optimizer for all experiments. The training is terminated when the validation loss increases for more than five epochs, or reaching a maximum number of epochs. This results in 80 to 150 epochs of training for learning f R , E a , E p , E Ã p , and D, depending on the dataset and the specific experiment, and 10 to 20 epochs for learning H B and HB: The inference runs at about 10 Hz. On a single 1080 Ti GPU. We use a batch size of 32 for all experiments unless otherwise stated.

Experimental validation
We evaluate on several public, real-world datasets collected with vehicles equipped with on-board range sensors. The datasets we use come with metric ground truths that are decently accurate, though we noticed the GPS/INS solutions in certain places can drift up to a few meters.
We add large artificial pose offsets to the ground truth when querying for a satellite image, thereby simulating a realistic robot navigation scenario where the initial pose estimate can solve place recognition, but is too coarse for the robot's metric pose. Using a map (e.g., satellite) image queried at this coarse initial pose estimate, our method solves metric localization by comparing against the live sensor data. The true pose offsets are hidden during training as our method is self-supervised, and are only revealed at test time for evaluation purposes.
The artificial offset is chosen such that the initial estimate has an unknown heading error in the range ½À p 8 , p 8 , therefore given the initial estimate u 0 , the rotation inference must choose a solution space of at least ½u 0 À p 8 , u 0 + p 8 to guarantee the correct solution can be found. We use a pixelwise translation error in the range ½À25, 25 pixels.
Depending on the resolution for a specific experiment, this corresponds to an error of at least ½À10m, 10m and up to more than ½À20m, 20m: All experiments use input images of size 256 × 256: Fig. 11. Overall data flow of our method at inference: given map image A and live data image B, based on the initial heading estimate, we form a stack of rotated images fB u0 , . . . , B un g, from which f R discovers B u that is B rotated to be rotation-aligned with A: This process also infers the heading estimateû: Here A and B u are used to generate a synthetic imageB u, a that has the same appearance and observed scene as B u and is pose-aligned with A; andB u, a and B u are projected to deep embeddingsB y u, a and B y u , where the estimate for the translation offsetâ is found by correlation maximization. Appearance Encoder E a

Radar localization against satellite imagery
We evaluate our method on two datasets with FMCW radar and GPS: the Oxford Radar RobotCar Dataset  and the MulRan Dataset (Kim et al., 2020). The satellite images for RobotCar are queried using Google Maps Platform. 2 For MulRan they are queried using Bing Maps Platform, 3 as high-definition Google satellite imagery is unavailable at the place of interest.
The GPS/INS and range sensor data for all datasets used are timestamped. To create the ground truth, for each frame of range sensor data, we find its associated latitude, longitude, and heading from the GPS/INS data based on the time-stamp, and query a satellite image with the corresponding latitude and longitude. We also rotate the range sensor image with the ground-truth heading to generate a rotation-aligned range sensor image. To add the initial offset for simulating a coarse initial estimate, we simply shift the center of the satellite images by the translation offset and rotate the range sensor image by the heading offset when forming the range sensor-satellite pairs. We benchmark against the prior supervised method RSL-Net (Tang et al., 2020b) in our experiments, which is evaluated only on the RobotCar Dataset. Both datasets contain repeated traversals of the same routes. We separately train, validate, and test for every dataset, splitting the data as in Figure 12. For the RobotCar Dataset, we split the trajectories the same way as in Tang et al. (2020b) for a fair comparison. For the RobotCar Dataset, the training set consists of training data from sequences no. 2, no. 5, and no. 6, whereas we test on the test data from sequence no. 2. For the MulRan Dataset, we used sequences KAIST 01 and Sejong 01. The resulting test sets feature an urban environment (RobotCar), a campus (KAIST 01) and a highway (Sejong 01).
We test on every fifth frame, resulting in 201 frames from the RobotCar Dataset and 358 from the MulRan Dataset, spanning a total distance of almost 4 km. The resolution used is 0:8665m=pixel for RobotCar and 0:7876m=pixel for MulRan. All sensor data are pre-scaled to have the same resolution as satellite images. The mean errors and standard deviations around the mean are reported in Table 5. Starting from a large initial offset, we can localize to an error of a few meters in translation and a few degrees in heading. Our method achieves an error on par with the supervised approach in Tang et al. (2020b), while requiring no metrically accurate ground truth for training.

Lidar localization against satellite imagery
For this experiment, we evaluate on the RobotCar Dataset  which also has two Velodyne HDL-32E lidars mounted in a tilted configuration, and KITTI (raw dataset ) which has a Velodyne HDL-64E lidar and GPS data.
For the RobotCar Dataset, the trajectories are split into training, validation, and test sets approximately the same Table 3. Image generation for our implementation of RSL-Net for comparison.
Encoder E way as in Section 6.1. For the KITTI Dataset, the training set includes sequences 20110929_drive0071, 20110930_drive0028, and 20111003_drive0027. Sequence 20110926_drive0117 is used for validation. Finally, data in 20111003_drive0034 are split into training and test, as shown in Figure 12. To turn 3D lidar point-clouds to lidar images, the point-clouds are projected to the x-y plane. We discard points with z values smaller than zero to remove ground points when creating the lidar images. Although such a simple approach may result in certain non-ground points removed, we observe that such an effect is rather minimal. The resulting lidar images are grey-scale images where pixel values are the normalized intensity.
As lidars have a shorter range than radars, we use satellite images of a greater zoom level, with resolution 0:4332m=pixel for RobotCar and 0:4592m=pixel for KITTI. The test set consists of 200 frames for RobotCar and 253 for KITTI, spanning a total distance of near 3km: The test set for KITTI features a residential area. The results are reported in Table 6. The error on the RobotCar Dataset is smaller when using lidar for localization than when using radar.

Radar localization against prior lidar map
Though our method is designed for localizing against satellite imagery, we show it can also handle more standard forms of cross-modality localization. Here we build a lidar map using a prior traversal, and localize using radar from a later traversal.
We demonstrate on the RobotCar and MulRan datasets, where we use the same resolution as in Section 6.1. For RobotCar, we use ground truth to build a lidar map from sequence no. 2. Radar data in the training sections from no. 5 and no. 6 as in Figure 12 form the training set, whereas the test section from sequence no. 5 forms the test set. For MulRan, lidar maps are built from KAIST 01 and Sejong 01, and we localized using radar data from KAIST 02 and Sejong 02, which are split into training, validation, and test sets. This resulted in a test set consisting of 201 frames from RobotCar and 272 frames from MulRan, spanning a total distance of almost 4 km. The localization results are Table 7.
This experiment is more suitable for naive image generation methods such as CycleGAN (Zhu et al., 2017) than previous experiments, because the field of view is considerably more compatible as both modalities are from range sensors. In Table 7, we list results where we replaced the image generation stage of our method by CycleGAN while keeping other modules unchanged. The localization results are, however, much worse when modality A is satellite imagery, as shown qualitatively in Figure 5.

Online pose-tracking system
In prior experiments we assumed place recognition is always available, providing a coarse initial estimate for every frame. Here we present a stand-alone pose-tracking system by continuously localizing against satellite imagery. Given a coarse initial estimate (e.g., from GPS) for the first frame, the vehicle localizes and computes its pose within the satellite map. The initial estimate for every frame onward is then set to be the computed pose of the previous frame. We only need place recognition once at the very beginning; the vehicle then tracks its pose onward without relying on any other measurements.
6.4.1. Introspection. As localizing using satellite imagery is challenging, the result will not always be accurate. Our method, however, naturally allows for introspection. A synthetic imageB u, a was generated from A and B u : We can apply a known small translation offset d to A to form A d : Taking A d and B u we can generateB u, a + d : Finally, we can compute a translation offsetd by passingB u, a + d andB u, a through the learned embeddings and maximizing correlation: Let d intro = d Àd : A large value of d intro indicates the generated images are erroneous. This allows us to examine the solution quality; our system falls back to using odometry for dead-reckoning when d intro exceeds a threshold. We do not require high-quality odometry, but rather only use a naive approach by directly maximizing correlation between two consecutive frames without any learned modules. In our experiments, we set d to be 10 10 ½ T , and d intro to be 5.  Figure  13. If the solution error is too large, then the initial estimate will be too off for a sufficient overlap between the next queried satellite image and live data, resulting in losing track of the vehicle. Although the solution error is large for certain frames, our system continuously localizes the vehicle for over a kilometer without completely losing track. For the lidar experiment, the solutions are sufficiently accurate to not require any odometry. Each solution only uses a single frame of data (plus the solution from the previous frame for the initial estimate), and we make no attempt at windowed/batch optimization or loop closures.

Ablation study
We perform an ablation study to investigate the effect of reduced training data. For radar localization against satellite imagery on the RobotCar Dataset, we trained a model using approximately the first 20% of training data, and another using every 10 th frame of training data. The results are shown in Table 8. Noticeably, the choice of selecting uniformly distributed data in contrast to utilizing only the first 20% leads to a more varied dataset, and as such achieves better performances despite the lower number of samples.

Testing on different sequences
For the results in Tables 5, 6, and 7, the training and test data, though with zero spatial overlap, are from the same sequences. The RobotCar and MulRan datasets contain repeated sequences of the same trajectory. Here we show results where the test set trajectories are from different sequences, to demonstrate the capability of generalizing not only to unseen places, but also to range sensor data recorded on different days as the training data. Fig. 13. Estimated pose (blue) versus ground-truth pose (red) for localizing a radar (left) and a lidar (right) against satellite imagery. Our system tracks the vehicle's pose over 1 km, where we occasionally fall back to odometry for the radar experiment (green). Our system is stand-alone and requires GPS only for the first frame.
We arbitrarily selected sequences no. 7 and no. 15 as the new test sequences for RobotCar, and sequences KAIST 02 and Sejong 02 for MulRan. The same test trajectories as in Figure 12 are extracted for the new test sequences. The training trajectories and sequences remain unchanged.
The localization results are listed in Table 9 and Table 10. The mean errors are fairly consistent across different sequences.

Circular initial offset
For a fair comparison against the supervised approach in Tang et al. (2020b), we assume the initial offset for both x and y is uniformly distributed and in the range ½À25, 25 pixels, which was also employed in Tang et al. (2020b). However, a more realistic initial offset that more closely mimics the error characteristics of a GPS would be sampling the initial offset from a circular area of radius r from the ground-truth position.
Without re-training, we evaluate the model performance where the initial translation offset is sampled uniformly from a circular area with a radius of 25 pixels, and centered at the ground truth position. The sampling for the initial heading offset remains unchanged. The localization results are summarized in Tables 11 and 12.

Trade-offs on network width and depth
Here we show the effects of network width (number of channels in each layer) and depth (number of layers) on solution quality and the associated trade-offs, and justify choosing the architecture shown in Tables 1 to 4. In this experiment, we vary the width or depth of networks for rotation inference and image generation, namely f R , E a , E p , E Ã p , D, while keeping H B and HB unchanged. The results are listed in Table 13.
First, we fix the network depth while halving or doubling the number of channels in each layer. When the width is reduced, the network representation power is noticeably affected, indicated by an increase in solution error in all of x, y, and u: When the width is doubled, we achieved a slight reduction on the rotation error and overall translation error. However, the total number of network parameters increase by a factor of four when the width is doubled, greatly limiting the maximum affordable batch size, and thereby increasing the training time. As a result, we opted the architecture presented in Section 5.
Next, we keep the number of channels the same in the first layer, and study the effect of making the networks shallower or deeper (by one layer). For image generation we use an encoder-decoder architecture with stride 2 in the convolution layers (after the first layer) as shown in Table 2, thus the height and width of the representation decrease by a factor of 2 after each layer, becoming 16 × 16 at the bottleneck with an input of size 256 × 256. Intuitively, a trade-off exists such that deeper networks theoretically have stronger representation power, but will result in reduced representation size at the bottleneck, making image reconstruction harder. The results in Table 13 suggest that our choice of network depth is optimal with noticeably smaller translation error, which is affected by the quality of image generation.

Choice of image resolution
In the results presented in Sections 6.1 and 6.2, we used a resolution of 0:8665m=pixel for experiments on radar localization against satellite imagery for the RobotCar Dataset, and a resolution of 0:4332m=pixel for lidar localization against satellite imagery. The resolutions used correspond to a ''zoom level'' of 17 and 18 respectively when querying from Google Maps Platform, and are chosen based on the effective sensing range of radar and lidar.
To study the effect of image resolution on solution quality, we created another dataset for radar localization against satellite imagery from the RobotCar Dataset, where the resolution is set to be 0:4332m=pixel, effectively ''zooming in'' on both the radar and satellite images. A comparison between images with the refined resolution and images used in the experiment in Section 6.1 is shown in Figure 14. We consider the same translation error in pixels for the initial estimate when comparing the two resolutions. The localization results are listed in Table 14.
Overall, choosing a resolution of 0:4332m=pixel resulted in slightly larger mean error in x and u. Although zoomed-in images offer more refined resolution, without changing the size of the input images, by ''zooming in,'' measurements far from the sensor are essentially discarded, as shown in Figure 14, thereby limiting the amount of information being passed through the network.
Owing to the pixel-based nature of CNNs, the upper limit our method can handle in terms of initial translation offset is also inherently in pixels. Without changing the initial pose offset expressed in pixels, models trained with higher-resolution images offer a reduction in metric error, at the cost, however, of needing a smaller metric initial offset. In real-world applications, the coarse initial offset may be large, for example in places where direct GPS signals are occluded, and as such models trained with lowerresolution images are needed to handle such large initial offset. For a smaller metric initial offset, a model trained with higher-resolution images can be used to provide a more refined pose estimate. With this in mind, we present a ''hybrid'' approach that utilizes both lower and higher resolution images. Specifically, given an initial offset in the range ½À21:66m, 21:66m which corresponds to ½À25, 25 pixels with a resolution of 0:8665m=pixel, we first compute a solution using a model trained with lower-resolution images. Next, we take the estimated pose as the new initial translation offset and compute a refined solution using a model trained with higherresolution images at 0:4332m=pixel: This ''hybrid'' approach allows for the best of both worlds: the lower-resolution model allows large metric initial offsets, whereas the higherresolution model provides a further refinement, reducing the metric error. We demonstrate this approach for radar localization against satellite imagery on the RobotCar Dataset. The results are listed in the last row of Table 14: the metric error in both x and y are reduced compared with using a lowerresolution model only, by the sequential refinement with a higher-resolution model.

Choice of rotation increment
We have chosen a rotation increment of 28 when forming the rotation stack fBg for all of our experiments. Here we justify this choice by comparing the resulting rotation error for various increments.
By intuition, the solution error on u should decrease with finer increments. However, a trade-off exists such that for the same range of initial heading error (½À p 8 , p 8 in all our experiments), the number of rotated images forming fBg, n, increases due to the smaller rotation increments. This, in turn, increases the memory usage during training. When training on two 1080 Ti GPUs, limited by the memory requirement, we could only afford a batch size of 16 when the increment is set to 18, whereas we can use a batch size of at least 32 when the increment is 28 or larger.
As listed in Table 15, the mean error in u when choosing 18 as the rotation increment is slightly larger than when using 28, primarily because of the fact that the batch size is too small, in our hypothesis. We used a batch size of 32 for the experiments where the increment is 28, 38, and 48, and it is clear that the solution error increases with larger increments and, thus, coarser sampling. For an increment of 48 we could also afford a batch size of 64, however the resulting u error was higher than using 32 as the batch size.

Handling larger initial offset
So far, all experiments presented start from an initial translation offset of ½À25, 25 pixels, which corresponds to more than ½À20m, 20m for the radar experiments in Section 6.1.  In practice, the amount of offset our method can handle depends on the effective receptive field of the convolution layers in the encoder and decoder networks for image generation, namely E a , E Ã p , and D: If the offset is too large, the networks may not be able to encode and decode information needed to correctly generateB u, a : Supporting our intuition, Figure 15 shows that the solution error increases superlinearly with larger initial offset.
Our method, however, naturally allows for a strategy to deal with larger initial offsets at inference time than in the training data, without the need to re-train. At inference, rather than using just B u during image generation, we can apply known translation offsets d 1 , d 2 , d 3 , and d 4 to shift the center of B u onto each of the four quadrants. This is depicted in Figure 16, where as an example, we shift B u by ½À10, 10, ½10, 10, ½À10, À10, and ½10, À10 pixels to form B u, d 1 , B u, d 2 , B u, d 3 , and B u, d 4 , respectively. Figure 17 depicts a case where the translation offset a between the satellite image A and B u is a = 35 35 ½ T pixels. This is larger than the range of translation offsets in the training data, which is ½À25, 25 pixels, shown by the dashed box around the origin. However, the offset between B u, d 2 and A is 25 25 ½ T pixels, which is within the range our networks can handle, as shown in Figure 18.
Forming E Ã p (A, B u ) and E a (B u ) during image generation might lead to incorrect results when generatingB u, a , as the offset between A and B u is larger than what E Ã p is trained for. However, we can also generateB u, a using E Ã p (A, B u, d 2 ) and E a (B u, d 2 ) : It should be noted that such a combination does not suffer from the issue with large offsets, as discussed.
The remaining question is then which shifted image from B u, d 1 , B u, d 2 , B u, d 3 , and B u, d 4 to choose from. In Figure 19, as an example, we show that generatingB u, a using E Ã p (A, B u, d 1 ) and E a (B u, d 1 ) will also be problematic, as this combination also suffers from the issue with large offsets. The selection cannot be made ahead of the image generation as a is unknown. However, our method naturally allows for a strategy to select the correct quadrant to shift B u onto, without needing to know a a priori, using an introspective method similar in spirit to that presented in Section 6.4.1.
We can generate five versions ofB u, a using B u , B u, d 1 , B u, d 2 , B u, d 3 , and B u, d 4 , and introspect the quality of each B u, a : To do so, we apply a known shift g to A to query for another image A g , and we can also shift each B u, d i by g to form B u, d i + g (or B u, g for B u ), as in Figures 18 and 19.    For each shift d i (and zero shift for B u ), we can take the combination E Ã p (A g , B u, d i + g ) and E a (B u, d i + g ) to generatẽ B u, a + g , which should be pixel-wise aligned with A g : If generatingB u, a is problematic due to large offsets, then so will generatingB u, a + g be, as shown in Figure 19. On the other hand, if the networks can correctly produceB u, a , they can also correctly produceB u, a + g , as shown in Figure 18.
For each shift d i , we can compute a translation offsetĝ usingB u, a + g andB u, a : along with an error term e =ĝ À g k k: For the five pairs of synthetic images, the one that results in the smallest e will be used and passed downstream to solve forâ: This forms an introspective approach for handling initial offsets larger than what the models are trained for, by augmenting the original input B u with various shifted versions. Table 16 shows results on the RobotCar Dataset for radar localization against satellite imagery, where the initial offset is now in the range of ½À35, 35 pixels. We also include additional results where the networks are trained using the same offset range as at inference for comparison. Taking a model trained for an offset in the range ½À25, 25 pixels and evaluating directly with offsets in the range ½À35, 35 pixels, we resulted in higher errors compared with Section 6.1. However, with our input-augmentation approach by shifting B u and generatingB u, a multiple times, we can handle larger initial offsets without sacrificing significantly on accuracy. This method, however, introduces additional computational cost as multiple forward passes of image generation are needed.

Modality transfer
An interesting application would be to train models using data for localization between one type of range sensor (e.g., radar) and satellite imagery, and evaluate for localization between another type of range sensor (e.g., lidar) and satellite imagery. This is particularly useful if we wish to evaluate using a specific type of range sensor at inference, but do not have the associated training data with approximately known trajectories to query for satellite images.
Here we demonstrate how this can be achieved using unsupervised domain adaptation by image-to-image transfer. Specifically, we consider CycleGAN (Zhu et al., 2017) for transferring between radar scan images and lidar scan images. For transferring between images of single scans of radar and lidar data, we also implement the approach in Weston et al. (2020), where range sensor data are converted into polar coordinate representation prior to the domain adaptation. Figure 20 shows examples of radar and lidar Fig. 19. The resulting synthetic image will still be erroneous, if an incorrect quadrant is selected. Here the offset between B u, d1 and A is larger than what the networks can handle. In this case, generatingB u, a andB u, a + g will both be problematic due to the issue with offsets. Table 16. Radar localization against satellite imagery evaluated on the test set of RobotCar, where the initial translation offset is large at inference. We also included results from Table 5   , output of rotation inference B u (c), embedding B y u (d), pixel-wise aligned ground truth B u, a (e), synthetic imageB u, a (f), and embeddingB y u, a (g). From top to bottom: radar localization against satellite imagery evaluated on RobotCar (rows 1-2) and MulRan (rows 3-4), lidar localization against satellite imagery evaluated on RobotCar (rows 5-6) and KITTI (rows 7-8), and radar localization against lidar map evaluated on MulRan (row 9) and RobotCar (row 10).
images and their synthetic counterparts generated using our implementation of CycleGAN, for both polar and Cartesian coordinate representations.
At inference, given networks E a , E Ã p , D, H B , and HB trained for radar localization against satellite imagery, and a lidar image B l , we can generateB r = f L!R (B l ), whereB r is a synthetic radar image and f L!R is a network that transforms a lidar image to a synthetic radar image trained using the objectives of CycleGAN (Zhu et al., 2017).
We can then useB r as the input for inferring rotation, conditional image generation, and learning pose estimation using the pipeline detailed in Section 4.
The same can be performed for the other way around, where given networks trained for lidar localization against satellite imagery, we wish to perform radar localization against satellite imagery at inference time. Table 17 summarizes the results for modality transfer. Networks trained with one type of range sensor suffer from large localization errors when directly evaluated with another type of sensor. The errors are greatly reduced after applying domain adaptation and using transferred images as input. As shown in Figure 20, applying domain adaptation in polar coordinate representation led to more visually realistic synthetic images when transferring from lidar to radar data, and the smallest localization errors for networks trained on radar data and tested with lidar data. Applying domain adaptation in Cartesian coordinate representation, on the other hand, resulted in smaller errors for the reverse experiment. We use radar and lidar images of the same resolution in this experiment, and do not consider modality transfers that also involve a change in resolution.

Further qualitative results
Additional qualitative results are presented in Figure 21 showing various stages of our methods for different modalities and datasets.

Conclusion and future work
We present self-supervised learning to address cross-modality, metric localization between satellite imagery and onboard range sensors, without metrically accurate ground truth for training. Our approach utilizes a multi-stage network that solves for the rotation and translation offsets separately through the generation of synthetic range sensor images as an intermediate step. Our method is validated across a large number of experiments for multiple modes of localization, with results on par with a prior supervised approach. A coarse initial pose estimate is needed for our method to compute metric localization. Therefore, a natural extension would then be to solve place recognition for a range sensor within a large satellite map as a prior step to metric localization.