Archon

A Unified Multimodal Model for Holistic Digital Human Generation

1State Key Lab of CAD&CG, Zhejiang University 2Google 3Google DeepMind
*Equal contribution. Corresponding author.
CVPR 2026

TL;DR: Archon is a unified multimodal model reasoning over description, script, speech, animation, semantic video, image, and video, enabling any-to-any generation and editing.

Framework

Archon unified multimodal framework

Archon unifies description, script, speech, animation, semantic video, image, and RGB video in one autoregressive multimodal model. Modality-specific tokenizers map each signal into a shared discrete space, while a semantic-driven video decoder converts compact semantic representations into high-quality digital human videos for generation and editing.



What do we explore?

Archon explores how a single multimodal model can reason across multiple heterogeneous modalities, including text (description, script), audio (speech), animation (identity, expression, pose), semantic video, image, and video.

01

Any-to-Any Multimodal Modeling

How to fuse multiple modalities into one unified model while enabling flexible any-to-any generation?

  • A shared discrete token space from modality-specific tokenizers
  • Unified multimodal autoregressive model
  • Train on 72 multimodal tasks
02

Video Parameterization

How to efficiently tokenize continuous video signals without generating excessively many tokens that exceed context limits?

  • Memory-efficient video discretization
  • Semantic-driven video diffusion decoder
03

Reliable Cross-Modal Generation

How to reduce uncertainty and improve quality in complex cross-modal generations (e.g., speech-to-video)?

  • Thinking in modalities for reliable chains of modalities

Video

Any-to-Any Generation

Description → Script + Speech + Animation + Segmentation + Video

We demonstrate the generation of script, speech, animation, segmentation, and video driven solely by descriptions. The input description are displayed above each example. The generated script is shown below each video. The video visualization presents the generated animation, segmentation, and final video arranged from left to right.




Description + Script → Speech + Animation + Segmentation + Video

We showcase results where descriptions and scripts are employed to generate speech, animation, segmentation, and video. The corresponding input description and script are provided above each example. The video composite displays the generated animation, segmentation, and final video arranged from left to right.




Speech → Description + Script + Animation + Segmentation + Video

We demonstrate results where speech is used to generate description, script, animation, segmentation, and video. The inferred description and script are displayed below each example. The video composite visualizes the generated animation, segmentation, and final video arranged from left to right.




Animation → Segmentation + Video

We present results where animation serves as the condition to generate segmentation and video. The composite video displays the input animation, generated segmentation, and final video arranged from left to right.




Segmentation → Video

We present results where segmentation is utilized to generate video. Each demo displays the input segmentation and the synthesized video side-by-side (left to right).




Video (silent) → Description + Speech + Animation + Segmentation

We showcase results for video understanding , video dubbing, animation tracking, and video segmentation. From an input video, we parse the corresponding description, speech, animation, while obtaining segmentation via an off-the-shelf model. The inferred description is displayed below each example. The visual composite presents the input video, extracted animation, and video segmentation arranged from left to right.




Any Modality Editing

Script Editing

We showcase script editing. We modify the script of the original video (left) to generate an edited video (right) that articulates the new script while faithfully preserving the original appearance and voice.




Editing using Description

We present results for video editing via description. We modify the description of the original video to generate an edited video with a new appearance. When identity-defining attributes are altered (e.g., gender swap), we simultaneously adapt the voice to match the new identity (see second row). Notably, all unedited attributes and the original script are strictly preserved.




Animation Editing (Face Reenactment)

We present results for animation editing (face reenactment). We employ a reference video (left) to drive the motion of the original video. The resulting edited video (right) adopts the reference animation while retaining the original subject's appearance.




Comparisons

We present comparisons of speech-driven video generation against state-of-the-art methods. From left to right, the videos display: Ground Truth, Aniportrait, Echomimic, Hallo3, and Ours.




BibTeX

@inproceedings{bao2026archon,
  title={Archon: A Unified Multimodal Model for Holistic Digital Human Generation},
  author={Bao, Chong and Liu, Shichen and Yu, Lijun and Futschik, David and Moschoglou, Stylianos and Srivastava, Shefali and Bai, Ziqian and Tan, Feitong and Zhang, Guofeng and Cui, Zhaopeng and Fanello, Sean and Zhang, Yinda},
  booktitle={The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
  year={2026}
}