Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

CVPR 2026 Highlight

Tao Xie^1,2 Peishan Yang¹ Yudong Jin¹ Yingfeng Cai² Wei Yin² Weiqiang Ren²

Qian Zhang² Wei Hua³ Sida Peng¹ Xiaoyang Guo^2† Xiaowei Zhou^1†

¹Zhejiang University ²Horizon Robotics ³Zhejiang Lab

arXiv Code Model

Scal3R reconstructs large-scale 3D scenes from long video sequences via scalable test-time training.

Abstract

This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry and Oxford Spires datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency.

Interactive Examples

Preloaded Point Cloud Reconstructions

Explore reconstructed 3D point clouds from large-scale scenes. Drag to orbit and scroll to zoom.

Static preview mode is active

You opened this page through file://, so the real WebGL viewer cannot fetch the local modules and JSON point-cloud assets. That is why the point cloud inside the frame does not actually load when you open index.html directly. The preview strip below still works as a static selector. For the full interactive viewer, run python3 -m http.server 8765 in this folder and open http://127.0.0.1:8765/index.html.

File Mode Preview

Oxford Keble · Scal3R

Built from the local Oxford Spires xyz.ply source for fast web viewing. The horizontal strip below is still selectable even when you open the page directly as a local file.

SceneOxford Keble

MethodScal3R

ViewDetail

This frame is only a static fallback poster. The real point cloud is not loaded here under file://; orbit, zoom, and pose scrubbing require serving the directory over HTTP instead of opening the file directly.

cd scripts/gfm/scal3r/web
python3 -m http.server 8765

Oxford Keble Outdoor Oxford Spires walk with facades, trees, and long-range loop structure. Offline Oxford scene with the full point cloud and recovered cameras loaded together for static inspection.

Oxford Keble · Scal3R

Static Oxford point-cloud GLB plus gradient recovered cameras for cleaner geometry inspection in the embedded viewer.

SceneOxford Keble

MethodScal3R

ViewDetail

Example Point Clouds

Scroll horizontally if needed. Each preview box is a preloaded point-cloud example bound to the viewer above.

Method

Scal3R extends VGGT with test-time-trainable Global Context Memory so long RGB sequences can be processed chunk-by-chunk without losing sequence-wide context.

Scal3R pipeline with chunked video processing, global context memory, chunk-wise alignment, and point-cloud chunk previews

Given a large set of input RGB images, directly applying VGGT is infeasible due to the quadratic complexity of attention. VGGT-Long mitigates this by partitioning the input sequence into overlapping chunks and aligning adjacent results, but it still cannot exploit long-range contextual information and remains sensitive to local inconsistencies.

Inspired by Test-Time Training, Scal3R inserts Global Context Memory (GCM) modules after the global attention layers of VGGT. Each GCM is implemented with lightweight Adaptive Memory Units that are rapidly adapted through self-supervised updates, allowing the model to compress and retain long-range context while preserving VGGT's strong geometric reasoning capabilities.

During training and inference, the input sequence is divided into overlapping chunks and distributed across GPUs for parallel processing. We further introduce Global Context Synchronization (GCS), which all-reduces the adaptive memory updates across chunks so each local chunk can benefit from sequence-wide observations. This improves local accuracy, strengthens cross-chunk consistency, and enables scalable kilometer-scale 3D reconstruction from RGB-only videos.

Camera Trajectory Comparison

Scal3R maintains globally consistent trajectories even over multi-kilometer drives, while baselines accumulate severe drift.

Selected Trajectory Example

KITTI 07

KITTI Odometry · 1101 frames · 650 m

Sequence 07 is a strong long-horizon showcase: our trajectory stays tightly aligned to the dashed GT route while several baselines drift or collapse.

KITTI 07 · looped scene clip.

Metrics Toggle Click to show or hide the metric cards above.

GT trajectory Predicted trajectory

Scene Thumbnails

Click a scene below to refresh the six method panels above.

Acknowledgment

This work was partially supported by National Key R&D Program of China (No. 2024YFB2809105), NSFC (No. U24B20154), and Information Technology Center and State Key Lab of CAD&CG, Zhejiang University. We thank Tianyuan Zhang for helpful discussions on LaCT and Dongli Tan for valuable discussions. We also thank Haotong Lin for providing the captured video demo of Zijingang Campus, Zhejiang University, and the VGGT-Long authors for providing the cyberpunk recording demo video.

BibTeX

@misc{xie2026scal3rscalabletesttimetraining,
      title={Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction}, 
      author={Tao Xie and Peishan Yang and Yudong Jin and Yingfeng Cai and Wei Yin and Weiqiang Ren and Qian Zhang and Wei Hua and Sida Peng and Xiaoyang Guo and Xiaowei Zhou},
      year={2026},
      eprint={2604.08542},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.08542}, 
}