MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Arxiv 2025


Xingyi He1   Hao Yu1,2   Sida Peng1   Dongli Tan1   Zehong Shen1   Hujun Bao1†   Xiaowei Zhou1†

1State Key Lab of CAD&CG, Zhejiang University   2Shandong University

TL;DR


(1) The paper focuses on finding pixel correspondences for image pairs from different imaging principles.
(2) We propose a large-scale pre-training framework that utilizes cross-modal training signals, incorporating diverse data from various sources, to train models to recognize and match fundamental structures across images. This capability is transferable to real-world, unseen cross-modality image matching tasks.
(3) The pre-trained models achieve significant boost on various unseen cross-modality image registration tasks, demonstrating strong generalizabilities.

Abstract


Image and Video Side by Side
Example Image

Capabilities of the image matching model pre-trained by our framework. Green lines indicate the identified corresponding pixel localizations between images. Using the same network weight, our model exhibits impressive generalization abilities across extensive unseen real-world cross-modality matching tasks, benefiting diverse applications in disciplines such as (a) medical image analysis, (b) histopathology, (c) remote sensing, autonomous systems including (d) UAV positioning, (e) autonomous driving, and more.


Image matching, which aims to identify corresponding pixel locations between images, is crucial in a wide range of scientific disciplines, aiding in image registration, fusion, and analysis. In recent years, deep learning-based image matching algorithms have dramatically outperformed humans in rapidly and accurately finding large amounts of correspondences. However, when dealing with images captured under different imaging modalities that result in significant appearance changes, the performance of these algorithms often deteriorates due to the scarcity of annotated cross-modal training data. This limitation hinders applications in various fields that rely on multiple image modalities to obtain complementary information. To address this challenge, we propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals, incorporating diverse data from various sources, to train models to recognize and match fundamental structures across images. This capability is transferable to real-world, unseen cross-modality image matching tasks. Our key finding is that the matching model trained with our framework achieves remarkable generalizability across more than eight unseen cross-modality registration tasks using the same network weight, substantially outperforming existing methods, whether designed for generalization or tailored for specific tasks. This advancement significantly enhances the applicability of image matching technologies across various scientific disciplines and paves the way for new applications in multi-modality human and artificial intelligence (AI) analysis and beyond.

Method



Preliminary: We first introduce two types of transformer-based detector-free matching architectures, including dense and semi-dense, serving as base models for our pre-training framework.



Pipeline. The proposed large-scale universal cross-modality pre-training framework consists of (1) a multi-resource dataset mixture engine designed to generate image pairs with ground truth matches by integrating the strengths of various data types. This engine is composed of (i) multi-view images with known geometry datasets that obtain ground truth matches by warping pixels using depth maps to other images; (ii) video sequences by leveraging the continuity inherent in video frames to construct point trajectories in a coarse-to-fine manner, and then build training pairs with pseudo ground truth matches between distant frames; (iii) image warping that sample transformations to construct synthetic image pairs with perspective changes for large-scale single image datasets. (2) Subsequently, cross-modality training pairs are generated to train matching models in learning fundamental image structure and geometric information, which is achieved by using image generation models to obtain pixel-aligned images in other modalities, and then substituted for the original image in the training pairs.

Training & Evaluation Metrics


Training: Our data engine comprises a mixture of datasets, including MegaDepth, ScanNet++, BlendedMVS, DL3DV, SA-1B, and Google Landmark, as well as synthetic cross-modality pairs, including visible-depth, visible-thermal, and day-night. Pre-training is performed on 16 NVIDIA A100-80G GPUs with a batch size of 64. Models are trained from scratch, with training taking approximately 4.3 days for ELoFTR and 6 days for ROMA. For all experiments in this paper, we use the same pre-trained weights for each method during evaluation.

Quantative and Qualtative Comparisons

More Statistics Comparisons



Citation


@inproceedings{he2025matchanything,
  title={MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training},
  author={He, Xingyi and Yu, Hao and Peng, Sida and Tan, Dongli and Shen, Zehong and Bao, Hujun and Zhou, Xiaowei},
  booktitle={Arxiv},
  year={2025}
}