This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry and Oxford Spires datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency.
Interactive Examples
Explore reconstructed 3D point clouds from large-scale scenes. Drag to orbit and scroll to zoom.
file://; orbit, zoom, and pose scrubbing require serving the directory over HTTP instead of opening the file directly.cd scripts/gfm/scal3r/webpython3 -m http.server 8765Scal3R extends VGGT with test-time-trainable Global Context Memory so long RGB sequences can be processed chunk-by-chunk without losing sequence-wide context.
Given a large set of input RGB images, directly applying VGGT is infeasible due to the quadratic complexity of attention. VGGT-Long mitigates this by partitioning the input sequence into overlapping chunks and aligning adjacent results, but it still cannot exploit long-range contextual information and remains sensitive to local inconsistencies.
Inspired by Test-Time Training, Scal3R inserts Global Context Memory (GCM) modules after the global attention layers of VGGT. Each GCM is implemented with lightweight Adaptive Memory Units that are rapidly adapted through self-supervised updates, allowing the model to compress and retain long-range context while preserving VGGT's strong geometric reasoning capabilities.
During training and inference, the input sequence is divided into overlapping chunks and distributed across GPUs for parallel processing. We further introduce Global Context Synchronization (GCS), which all-reduces the adaptive memory updates across chunks so each local chunk can benefit from sequence-wide observations. This improves local accuracy, strengthens cross-chunk consistency, and enables scalable kilometer-scale 3D reconstruction from RGB-only videos.
Scal3R maintains globally consistent trajectories even over multi-kilometer drives, while baselines accumulate severe drift.
Drag the slider to compare 3D reconstruction quality between Scal3R and baselines.
Use img-comparison-slider for interactive before/after comparisons.
Our Global Context Synchronization mechanism shares information across all chunks, eliminating boundary artifacts and ensuring global consistency.
Replace with ablation visualizations showing the effect of global context synchronization.
Scal3R scales linearly with sequence length while maintaining stable accuracy.
This work was partially supported by National Key R&D Program of China (No. 2024YFB2809105), NSFC (No. U24B20154), and Information Technology Center and State Key Lab of CAD&CG, Zhejiang University. We thank Tianyuan Zhang for helpful discussions on LaCT and Dongli Tan for valuable discussions. We also thank Haotong Lin for providing the captured video demo of Zijingang Campus, Zhejiang University, and the VGGT-Long authors for providing the cyberpunk recording demo video.
@misc{xie2026scal3rscalabletesttimetraining,
title={Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction},
author={Tao Xie and Peishan Yang and Yudong Jin and Yingfeng Cai and Wei Yin and Weiqiang Ren and Qian Zhang and Wei Hua and Sida Peng and Xiaoyang Guo and Xiaowei Zhou},
year={2026},
eprint={2604.08542},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.08542},
}