Visual localization plays an important role in the applications of Augmented Reality (AR), which enable AR devices to obtain their 6-DoF pose in the pre-build map in order to render virtual content in real scenes. However, most existing approaches can not perform novel view rendering and require large storage capacities for maps. To overcome these limitations, we propose an efficient visual localization method capable of high-quality rendering with fewer parameters. Specifically, our approach leverages 3D Gaussian primitives as the scene representation. To ensure precise 2D-3D correspondences for pose estimation, we develop an unbiased 3D scene-specific descriptor decoder for Gaussian primitives, distilled from a constructed feature volume. Additionally, we introduce a salient 3D landmark selection algorithm that selects a suitable primitive subset based on the saliency score for localization. We further regularize key Gaussian primitives to prevent anisotropic effects, which also improves localization performance. Extensive experiments on two widely used datasets demonstrate that our method achieves superior or comparable rendering and localization performance to state-of-the-art implicit-based visual localization approaches.
Method
Reconstruction Process
We incrementally initialize the Gaussian primitives, and each primitive is associated with position $\mu$, rotation $q$, scale $s$, opacity $\sigma$, color $c$, and 3D landmark score $a$. For key Gaussian primitives, we perform soft isotropy and scale regularization to mitigate the anisotropic results. The color loss $\mathcal{L}_{c}$, depth loss $\mathcal{L}_d$, 3D landmark loss $\mathcal{L}_m$, and regularization loss $\mathcal{L}_{reg}$ are used to optimize the properties of each primitive via differentiable rasterization.
Descriptor Learning
The pipeline of our unbiased 3D primitive descriptor learning. We first encode images based on the 2D CNN model (SuperPoint) to obtain the multi-view feature maps and construct the 3D scene feature volume according to the depth and pose information. To enhance the representation ability of the 3D feature decoder, we use multi-resolution parametric encoding to aid the 3D scene-specific descriptor learning. Besides, we only sample descriptors on the scene surface for effective distillation.
Experiments
Visual Localization
Visual Localization Performance: We report median translation and rotation errors (cm, degree) on Replica and 12-Scenes.
Novel view synthesis
Novel View Synthesis Performance: We report PSNR, SSIM, LPIPS metrics on Replica.
AR Applications
AR Applications: We show two different AR applications on scene $\texttt{Room 0}$ from the Replica dataset. $\textbf{(1) Insert Objects}$: we virtual AR objects and $\texttt{IEEE VR}$ text logo into real scene. $\textbf{(2) Physical collision}$: we place a virtual blanket in the scene and let it fall naturally to simulate physical collision between our reconstructed scene geometry and the virtual blanket. As can be seen, our approach can perform high-quality rendering and handle the collision and occlusion between real and virtual content very well.
Localization Process
Localization Process: We show the $\textcolor{red}{\text{estimeted pose}}$, $\textcolor{green}{\text{2D-3D correspondece}}$, and $\textcolor{black}{\text{GT pose}}$ in different colors.
@article{splatloc,
author={Zhai, Hongjia and Zhang, Xiyu and Zhao Boming and Li, Hai and He, Yijia and Cui, Zhaopeng and Bao, Hujun and Zhang, Guofeng},
journal={arXiv preprint arXiv:2409.14067},
title={SplatLoc: 3D Gaussian Splatting-based Visual Localization for Augmented Reality},
year={2024},
}