VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition

¹Technical University of Munich, ²Munich Center for Machine Learning, ³Microsoft

arXiv 2024

Abstract

Recent works on the global place recognition treat the task as a retrieval problem, where an off-the-shelf global descriptor is commonly designed in image-based and LiDAR-based modalities. However, it is non-trivial to perform accurate image-LiDAR global place recognition since extracting consistent and robust global descriptors from different domains (2D images and 3D point clouds) is challenging. To address this issue, we propose a novel Voxel-Cross-Pixel (VXP) approach, which establishes voxel and pixel correspondences in a self-supervised manner and brings them into a shared feature space. Specifically, VXP is trained in a two-stage manner that first explicitly exploits local feature correspondences and enforces similarity of global descriptors. Extensive experiments on the three benchmarks (Oxford RobotCar, ViViD++ and KITTI) demonstrate our method surpasses the state-of-the-art cross-modal retrieval by a large margin. The code will be publicly available.

@article{li2024vxp, title={VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition}, author={Li, Yun-Jin and Gladkova, Mariia and Xia, Yan and Wang, Rui and Cremers, Daniel}, journal={arXiv preprint arXiv:2403.14594}, year={2024} }

VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition

Voxel-Cross-Pixel (VXP) can effectively map data from different modalities into a same shared space, enabling more robust and flexible place recognition. The above videos show differnt types of retrievels done by VXP.

Abstract

Successful 2D-3D retrieval on Oxford RobotCar dataset.

Successful 3D-2D retrieval on Oxford RobotCar dataset.

Successful 2D-3D retrieval on day1-day2 ViViD++ dataset.

Successful 3D-2D retrieval on night-day2 ViViD++ dataset.

Successful 2D-3D retrieval on KITTI (00).

Successful 3D-2D retrieval on KITTI (00).

Local descriptor space matching

Robust Retrieval at Night

2D-2D place recognition fails in this situation (query: night, database: evening). However, 3D-2D place recognition can still work robustly.

Different Light Condition

Sometimes, 2D-2D place recognition fails to retrieve the most precise image location in the database (query: day1, database: evening). However, 2D-3D place recognition can provide more accurate and consistent retrievals even with the presence of light condition changes.

BibTeX