Framework

Archon unifies description, script, speech, animation, semantic video, image, and RGB video in one autoregressive multimodal model. Modality-specific tokenizers map each signal into a shared discrete space, while a semantic-driven video decoder converts compact semantic representations into high-quality digital human videos for generation and editing.

What do we explore?

Archon explores how a single multimodal model can reason across multiple heterogeneous modalities, including text (description, script), audio (speech), animation (identity, expression, pose), semantic video, image, and video.

01

Any-to-Any Multimodal Modeling

How to fuse multiple modalities into one unified model while enabling flexible any-to-any generation?

A shared discrete token space from modality-specific tokenizers
Unified multimodal autoregressive model
Train on 72 multimodal tasks

02

Video Parameterization

How to efficiently tokenize continuous video signals without generating excessively many tokens that exceed context limits?

Memory-efficient video discretization
Semantic-driven video diffusion decoder

03

Reliable Cross-Modal Generation

How to reduce uncertainty and improve quality in complex cross-modal generations (e.g., speech-to-video)?

Thinking in modalities for reliable chains of modalities

Video

Any-to-Any Generation

Description → Script + Speech + Animation + Segmentation + Video

We demonstrate the generation of script, speech, animation, segmentation, and video driven solely by descriptions. The input description are displayed above each example. The generated script is shown below each video. The video visualization presents the generated animation, segmentation, and final video arranged from left to right.

Description (input):

{"appearance": {"gender": "male", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["brown"], "hair_style": ["short", "wavy"], "facial_features": ["none_discernible"], "clothing": {"upper_body": "dress_shirt", "lower_body": "trousers", "footwear": "none", "accessories": ["tie", "none"], "dominant_colors": ["white", "red", "blue"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking_and_gesturing_with_hands"], "emotion": ["serious_and_engaged"], "energy_level": ["consistent"], "mouth_action": ["narrow_opening", "relaxed_lips", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["raising_inner_eyebrows", "brow_furrowing"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively", "nods_rhythmically"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "orange_and_white_wall_with_blurred_plant_and_metal_fencing", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (output): Driving cross country is something everyone should do.

Description (input):

{"appearance": {"gender": "female", "age_group": "adult", "ethnicity": ["hispanic", "middle_eastern"], "body_build": ["average"], "hair_color": ["black"], "hair_style": ["curly", "long"], "facial_features": ["makeup", "none_discernible"], "clothing": {"upper_body": "long-sleeved black shirt", "lower_body": "none", "footwear": "none", "accessories": ["earrings", "none"], "dominant_colors": ["black"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "earrings", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "smile", "overall_impression": ["speaking to the camera", "hand gestures"], "emotion": ["confident", "engaging"], "energy_level": ["medium"], "mouth_action": ["wide_opening", "relaxed_lips", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["raising_inner_eyebrows", "raising_outer_eyebrows"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively", "nods_rhythmically"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "green wall with a white mantle, two framed pictures on the wall", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (output): Technically, you could do it that way, but it is not recommended.

Description (input):

{"appearance": {"gender": "male", "age_group": "senior", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["gray", "white"], "hair_style": ["short", "straight"], "facial_features": ["none_discernible"], "clothing": {"upper_body": "suit_jacket_and_dress_shirt", "lower_body": "trousers", "footwear": "none", "accessories": ["tie", "none"], "dominant_colors": ["blue", "white", "gray"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking_directly_to_to_camera", "slight_head_movements", "occasional_smiles"], "emotion": ["serious", "calm", "confident"], "energy_level": ["consistently_low-key"], "mouth_action": ["narrow_opening", "relaxed_lips", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["brow_lowering", "raising_inner_eyebrows"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively", "amplitude_and_tempo_of_movements", "directional_changes_pitch/yaw/roll", "transitions_between_stillness_and_motion"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead", "fixed_forward_focus"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["microphone", "none"]}, "environment": {"lighting_conditions": "artificial_light", "background_description": "solid_black_background", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (output): Growing your social media following fast is easier with this strategy.

Description (input):

{"appearance": {"gender": "male", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["bald"], "hair_style": ["bald"], "facial_features": ["none_discernible"], "clothing": {"upper_body": "dress_shirt", "lower_body": "none", "footwear": "none", "accessories": ["none"], "dominant_colors": ["white", "purple", "gold"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking_directly_to_camera"], "emotion": ["calm", "neutral"], "energy_level": ["low-key"], "mouth_action": ["narrow_opening", "relaxed_lips", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["none"], "blink_frequency": "medium", "head_action": ["mostly_centered"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "white_brick_wall_and_fire_decoration", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (output): On this level without taking any damage is the ultimate gamer flex.

Description (input):

{"appearance": {"gender": "female", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["brown"], "hair_style": ["long", "wavy"], "facial_features": ["glasses", "makeup", "none_discernible"], "clothing": {"upper_body": "blouse, jacket", "lower_body": "none", "footwear": "none", "accessories": ["none"], "dominant_colors": ["black", "red", "pink"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking to the camera"], "emotion": ["serious and sincere"], "energy_level": ["consistent"], "mouth_action": ["narrow_opening", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["brow_lowering"], "blink_frequency": "medium", "head_action": ["mostly_centered"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "blurred background with cabinets, a lamp, and a picture frame", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "office"}}

Script (output): Googled questions about me are getting answered right now in this video.

Description (input):

{"appearance": {"gender": "male", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["brown"], "hair_style": ["short", "spiky"], "facial_features": ["none_discernible"], "clothing": {"upper_body": "clerical_collar", "lower_body": "none", "footwear": "none", "accessories": ["none"], "dominant_colors": ["black", "white"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking_to_camera"], "emotion": ["calm", "serious"], "energy_level": ["consistently_low-key"], "mouth_action": ["narrow_opening", "relaxed_lips", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["brow_lowering"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "solid_orange_background", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (output): You just have to take a break and recharge your batteries.

Description + Script → Speech + Animation + Segmentation + Video

We showcase results where descriptions and scripts are employed to generate speech, animation, segmentation, and video. The corresponding input description and script are provided above each example. The video composite displays the generated animation, segmentation, and final video arranged from left to right.

Description (input):

{"appearance": {"gender": "male", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["blonde"], "hair_style": ["short", "wavy"], "facial_features": ["none_discernible"], "clothing": {"upper_body": "dress_shirt", "lower_body": "trousers", "footwear": "none", "accessories": ["none"], "dominant_colors": ["gray", "blue", "white"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking_while_sitting"], "emotion": ["serious"], "energy_level": ["low-key"], "mouth_action": ["narrow_opening", "relaxed_lips"], "eyebrow_action": ["brow_lowering"], "blink_frequency": "medium", "head_action": ["mostly_centered"], "eye_state": ["partially_narrowed_relaxed"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "wooden_wall", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (input): And there are people who say there's no such thing about audience

Description (input):

{"appearance": {"gender": "male", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["black"], "hair_style": ["short", "wavy"], "facial_features": ["beard", "mustache", "none_discernible"], "clothing": {"upper_body": "t-shirt", "lower_body": "none", "footwear": "none", "accessories": ["none"], "dominant_colors": ["black"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "frown", "overall_impression": ["man_is_speaking_to_the_camera", "man_is_gesturing_with_his_right_hand"], "emotion": ["serious_and_concerned"], "energy_level": ["moderate"], "mouth_action": ["narrow_opening", "lip_protrusion_or_retraction", ""], "eyebrow_action": ["brow_furrowing", "brow_lowering"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively"], "eye_state": ["partially_narrowed_relaxed"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "shelf_with_movies_and_a_beanbag_on_the_floor_behind_the_person_behind_the_person_behind_the_man_and_a_white_wall", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "home"}}

Script (input): The last that's left of them. It wasn't, though, when they charged in the Battle of Winter

Description (input):

{"appearance": {"gender": "female", "age_group": "young_adult", "ethnicity": ["caucasian"], "body_build": ["slim"], "hair_color": ["blonde"], "hair_style": ["long", "wavy"], "facial_features": ["makeup", "none_discernible"], "clothing": {"upper_body": " \u2047 maclouse", "lower_body": "none", "footwear": "none", "accessories": ["earrings", "none"], "dominant_colors": ["black"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "earrings", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "open mouth", "overall_impression": ["speaking to the camera", "gesturing with hands"], "emotion": ["calm and engaged"], "energy_level": ["medium"], "mouth_action": ["wide_opening", "lip_protrusion_or_retraction", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["raising_inner_eyebrows", "raising_outer_eyebrows"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively", "nods_rhythmically"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "artificial_light", "background_description": "wooden planks", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (input): That if you are a new fan to Watchmen, that you can be a new

Description (input):

{"appearance": {"gender": "male", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["gray"], "hair_style": ["short"], "facial_features": ["beard", "none_discernible"], "clothing": {"upper_body": "turtleneck sweater and jacket", "lower_body": "none", "footwear": "none", "accessories": ["none"], "dominant_colors": ["black"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["looking down", "looking up", "looking to the side"], "emotion": ["subdued and thoughtful"], "energy_level": ["consistently low-key"], "mouth_action": ["relaxed_lips", "narrow_opening"], "eyebrow_action": ["brow_lowering"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively"], "eye_state": ["partially_narrowed_relaxed"], "gaze_direction": ["downward", "straight_ahead", "lateral_glances"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "blurred background with a painting and kitchen appliances", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "home"}}

Script (input): The intimate story, the character story, what links these.

Description (input):

{"appearance": {"gender": "male", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["black"], "hair_style": ["curly", "long", "other"], "facial_features": ["beard", "mustache", "none_discernible"], "clothing": {"upper_body": "jacket", "lower_body": "none", "footwear": "none", "accessories": ["none"], "dominant_colors": ["black", "gray"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking_to_camera"], "emotion": ["subdued_and_thoughtful"], "energy_level": ["consistently_low-key"], "mouth_action": ["narrow_opening", "relaxed_lips", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["brow_lowering"], "blink_frequency": "medium", "head_action": ["mostly_centered"], "eye_state": ["partially_narrowed_relaxed"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "dim", "background_description": "white_wall_and_window_with_blinds", "time_of_day": "evening", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (input): just as selective as you would be posting your own work.

Description (input):

{"appearance": {"gender": "female", "age_group": "young_adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["brown"], "hair_style": ["long", "wavy", "bangs"], "facial_features": ["makeup", "none_discernible"], "clothing": {"upper_body": "t-shirt", "lower_body": "none", "footwear": "none", "accessories": ["jewelry_e.g._earrings", "none"], "dominant_colors": ["white", "brown"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "smile", "overall_impression": ["smiling and engaging", "looking at the camera"], "emotion": ["happy", "friendly", "engaging"], "energy_level": ["medium"], "mouth_action": ["wide_opening", "relaxed_lips", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["raising_inner_eyebrows", "raising_outer_eyebrows"], "blink_frequency": "medium", "head_action": ["mostly_centered", "nods_rhythmically"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "white wall with pink and blue hexagonal lights, a bookshelf with books and stuffed animals, a plant", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (input): I'm going to walk you through the entire process.

Speech → Description + Script + Animation + Segmentation + Video

We demonstrate results where speech is used to generate description, script, animation, segmentation, and video. The inferred description and script are displayed below each example. The video composite visualizes the generated animation, segmentation, and final video arranged from left to right.

Description (output):

{"appearance": {"gender": "male", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["brown"], "hair_style": ["short", "wavy"], "facial_features": ["none_discernible"], "clothing": {"upper_body": "dress_shirt", "lower_body": "trousers", "footwear": "none", "accessories": ["tie", "none"], "dominant_colors": ["white", "red", "blue"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking_and_gesturing_with_hands"], "emotion": ["serious_and_engaged"], "energy_level": ["consistent"], "mouth_action": ["narrow_opening", "relaxed_lips", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["raising_inner_eyebrows", "brow_furrowing"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively", "nods_rhythmically"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "orange_and_white_wall_with_blurred_plant_and_metal_fencing", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (output): The director of the film that I want to talk about tonight's split M Night Shyamalan saw that

Description (output):

{"appearance": {"gender": "female", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["black"], "hair_style": ["long", "straight"], "facial_features": ["none_discernible"], "clothing": {"upper_body": "blouse", "lower_body": "none", "footwear": "none", "accessories": ["none"], "dominant_colors": ["blue"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "smile", "overall_impression": ["speaking directly to the camera", "slight head movements", "facial expressions change with speech"], "emotion": ["friendly and engaging", "positive and enthusiastic"], "energy_level": ["medium"], "mouth_action": ["wide opening", "lip corners pulled up", "synchronization and precision with spoken words"], "eyebrow_action": ["raising inner eyebrows", "raising outer eyebrows"], "blink_frequency": "medium", "head_action": ["mostly centered", "slight nods", "tilts inquisitively"], "eye_state": ["fully wide alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "plain wall with a framed picture and a patterned curtain in the background", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (output): Okay, today's tip is for you if you want to know how to get more auditions from your submissions.

Description (output):

{"appearance": {"gender": "male", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["gray"], "hair_style": ["short"], "facial_features": ["beard", "none_discernible"], "clothing": {"upper_body": "turtleneck sweater and jacket", "lower_body": "none", "footwear": "none", "accessories": ["none"], "dominant_colors": ["black"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["looking down", "looking up", "looking to the side"], "emotion": ["subdued and thoughtful"], "energy_level": ["consistently low-key"], "mouth_action": ["relaxed_lips", "narrow_opening"], "eyebrow_action": ["brow_lowering"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively"], "eye_state": ["partially_narrowed_relaxed"], "gaze_direction": ["downward", "straight_ahead", "lateral_glances"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "blurred background with a painting and kitchen appliances", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "home"}}

Script (output): the intimate story, the character story, what links these people.

Description (output):

{"appearance": {"gender": "female", "age_group": "senior", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["gray", "white"], "hair_style": ["short", "wavy"], "facial_features": ["none_discernible"], "clothing": {"upper_body": "jacket", "lower_body": "none", "footwear": "none", "accessories": ["jewelry_e.g._necklace/earrings/bracelet/ring", "none"], "dominant_colors": ["brown", "beige"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "earrings", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking", "looking_directly_at_camera"], "emotion": ["serious", "thoughtful"], "energy_level": ["low-key"], "mouth_action": ["narrow_opening", "relaxed_lips"], "eyebrow_action": ["brow_lowering"], "blink_frequency": "medium", "head_action": ["mostly_centered"], "eye_state": ["partially_narrowed_relaxed"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "indoor_stadium_with_baseball_field_in_background", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (output): that in 1969 he told an academic

Description (output):

{"appearance": {"gender": "female", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["blonde"], "hair_style": ["short", "straight"], "facial_features": ["none_discernible"], "clothing": {"upper_body": "tank_top", "lower_body": "none", "footwear": "none", "accessories": ["none"], "dominant_colors": ["black"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking to the camera", "hand gestures"], "emotion": ["calm and conversational"], "energy_level": ["consistently low-key"], "mouth_action": ["narrow_opening", "relaxed_lips", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["raising_inner_eyebrows", "raising_outer_eyebrows"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "artificial_light", "background_description": "multiple_screens_displaying_aurora_borealis_and_bookshelves_with_boxes_and_books", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (output): That the concept is the same, but the story

Description (output):

{"appearance": {"gender": "male", "age_group": "young_adult", "ethnicity": ["caucasian", "unknown"], "body_build": ["average"], "hair_color": ["brown"], "hair_style": ["short", "wavy"], "facial_features": ["beard", "veveveed_face"], "clothing": {"upper_body": "t-shirt", "lower_body": "none", "footwear": "none", "accessories": ["none"], "dominant_colors": ["gray"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "frown", "overall_impression": ["looking_down", "speaking"], "emotion": ["subdued_and_thoughtful"], "energy_level": ["consistently_low-key"], "mouth_action": ["narrow_opening", "relaxed_lips"], "eyebrow_action": ["brow_furrowing"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively"], "eye_state": ["partially_narrowed_relaxed"], "gaze_direction": ["downward"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "dim", "background_description": "black_background", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Script (output): Blunt, I did not know what was going to work, then

Animation → Segmentation + Video

We present results where animation serves as the condition to generate segmentation and video. The composite video displays the input animation, generated segmentation, and final video arranged from left to right.

Segmentation → Video

We present results where segmentation is utilized to generate video. Each demo displays the input segmentation and the synthesized video side-by-side (left to right).

Video (silent) → Description + Speech + Animation + Segmentation

We showcase results for video understanding , video dubbing, animation tracking, and video segmentation. From an input video, we parse the corresponding description, speech, animation, while obtaining segmentation via an off-the-shelf model. The inferred description is displayed below each example. The visual composite presents the input video, extracted animation, and video segmentation arranged from left to right.

Description (output):

{"appearance": {"gender": "female", "age_group": "senior", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["gray", "white"], "hair_style": ["short", "wavy"], "facial_features": ["glasses", "none_discernible"], "clothing": {"upper_body": "jacket", "lower_body": "none", "footwear": "none", "accessories": ["none"], "dominant_colors": ["red", "black"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking_with_serious_tone"], "emotion": ["serious", "thoughtful"], "energy_level": ["consistent"], "mouth_action": ["narrow_opening", "lip_protrusion_or_retraction", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["brow_lowering"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively"], "eye_state": ["partially_narrowed_relaxed"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "blurred_background_with_furniture", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Description (output):

{"appearance": {"gender": "female", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["blonde"], "hair_style": ["wavy", "long"], "facial_features": ["makeup", "none_discernible"], "clothing": {"upper_body": "blouse", "lower_body": "none", "footwear": "none", "accessories": ["jewelry_e.g._necklace", "none"], "dominant_colors": ["pink", "silver"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking_directly_to_the_camera", "slight_head_movements"], "emotion": ["serious", "concerned"], "energy_level": ["medium"], "mouth_action": ["narrow_opening", "relaxed_lips", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["raising_inner_eyebrows", "brow_furrowing"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively", "amplitude_and_tempo_of_movements", "directional_changes_pitch/yaw/roll", "transitions_between_stillness_and_motion"], "eye_state": ["partially_narrowed_relaxed"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "blurred_background_with_a_door_and_a_chair", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "home"}}

Description (output):

{"appearance": {"gender": "female", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["brown", "gray"], "hair_style": ["short", "curly", "wavy"], "facial_features": ["none_discernible"], "clothing": {"upper_body": "sweater", "lower_body": "none", "footwear": "none", "accessories": ["jewelry_e.g._necklace/earrings/bracelet/ring", "none"], "dominant_colors": ["red", "black"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "earrings", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "smile", "overall_impression": ["speaking directly to the camera", "slight head movements", "occasional blinking"], "emotion": ["calm and informative", "engaged and thoughtful"], "energy_level": ["consistently low-key"], "mouth_action": ["narrow_opening", "relaxed_lips", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["raising_inner_eyebrows", "brow_furrowing"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively", "nods_rhythmically"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "plain gray background", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Description (output):

{"appearance": {"gender": "male", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["brown"], "hair_style": ["short", "straight"], "facial_features": ["none_discernible"], "clothing": {"upper_body": "suit jacket, dress shirt", "lower_body": "trousers", "footwear": "none", "accessories": ["tie", "none"], "dominant_colors": ["gray", "orange", "blue"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "smile", "overall_impression": ["speaking_directly_to_camera", "slight_head_movements"], "emotion": ["positive", "engaged"], "energy_level": ["medium"], "mouth_action": ["wide_opening", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["raising_inner_eyebrows"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "artificial_light", "background_description": "bookshelf with various items", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Description (output):

{"appearance": {"gender": "male", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["gray"], "hair_style": ["short", "straight"], "facial_features": ["none_discernible"], "clothing": {"upper_body": "suit_jacket", "lower_body": "trousers", "footwear": "none", "accessories": ["tie", "none"], "dominant_colors": ["brown", "white", "blue"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking_directly_to_camera"], "emotion": ["calm", "professional"], "energy_level": ["medium"], "mouth_action": ["narrow_opening", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["raising_inner_eyebrows", "brow_lowering"], "blink_frequency": "medium", "head_action": ["mostly_centered", "nods_rhythmically"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "bright", "background_description": "office_setting_with_framed_documents_on_bookshelves", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "office"}}

Description (output):

{"appearance": {"gender": "male", "age_group": "adult", "ethnicity": ["caucasian"], "body_build": ["average"], "hair_color": ["brown"], "hair_style": ["short"], "facial_features": ["none_discernible"], "clothing": {"upper_body": "jacket", "lower_body": "unknown", "footwear": "none", "accessories": ["none"], "dominant_colors": ["brown", "gray", "blue"]}, "physical_attributes": {"visible_tattoos": "none", "visible_piercings": "none", "distinctive_marks": "none", "posture": "upright", "gait": "not_applicable", "physical_aids": ["none"]}}, "action": {"activity_type": "speaking", "expression": "closed mouth", "overall_impression": ["speaking_to_camera"], "emotion": ["neutral"], "energy_level": ["low-key"], "mouth_action": ["narrow_opening", "relaxed_lips", "synchronization_and_precision_with_spoken_words"], "eyebrow_action": ["brow_lowering"], "blink_frequency": "medium", "head_action": ["mostly_centered", "tilts_inquisitively"], "eye_state": ["fully_wide_alert"], "gaze_direction": ["straight_ahead"], "nonverbal_habits": ["none"], "interactions": ["none"], "props_used": ["none"]}, "environment": {"lighting_conditions": "artificial_light", "background_description": "blurred_blue_background", "time_of_day": "unknown", "weather_conditions": "indoor_not_applicable", "context": "indoor"}}

Any Modality Editing

Script Editing

We showcase script editing. We modify the script of the original video (left) to generate an edited video (right) that articulates the new script while faithfully preserving the original appearance and voice.

Original Script: I do. I have a big family, and I have a lot of little nieces and nephews and cousins.

Edited Script: I can use my voice in different characters.

Original Script: is your algorithm's uncanny ability to specifically classify the human.

Edited Script: Just was at a standstill after a while. There wasn't really much evidence.

Original Script: Jump Street picks up with Schmidt and Janko going undercover to

Edited Script: you know it's kind of it's a new aesthetic but you know people who were fans

Original Script: studios but what I can tell you is this has

Edited Script: is that act of applying our consciousness to solving the next problem.

Original Script: Everything just was at a standstill after a while. There wasn't really much evidence.

Edited Script: Here's an easy one for both of you. Okay Google, Alexa

Original Script: Sandra Bullock is giving back in a big way with the help of her eight-year-old daughter,

Edited Script: Just a girl standing in front of a boy, just.

Editing using Description

We present results for video editing via description. We modify the description of the original video to generate an edited video with a new appearance. When identity-defining attributes are altered (e.g., gender swap), we simultaneously adapt the voice to match the new identity (see second row). Notably, all unedited attributes and the original script are strictly preserved.

Left - Original Description: "appearance.age_group": adult
Right - Edited Description: "appearance.age_group": young_adult

Left - Original Description: "appearance.clothing.dominant_colors": ["blue"]
Right - Edited Description: "appearance.clothing.dominant_colors": ["white"]

Left - Original Description: "appearance.age_group": young_adult, "appearance.clothing.accessories": ["earrings"], "appearance.clothing.dominant_colors": ["black", "red"], "appearance.clothing.upper_body": lace print blouse
Right - Edited Description: "appearance.age_group": adult, "appearance.clothing.accessories": ["none"], "appearance.clothing.dominant_colors": ["purple"], "appearance.clothing.upper_body": dress_shirt

Original Description: "appearance.gender": male

Edited Description: "appearance.gender": female

Original Description: "appearance.gender": female

Edited Description: "appearance.gender": male

Original Description: "appearance.gender": female

Edited Description: "appearance.gender": male

Animation Editing (Face Reenactment)

We present results for animation editing (face reenactment). We employ a reference video (left) to drive the motion of the original video. The resulting edited video (right) adopts the reference animation while retaining the original subject's appearance.

Left: Reference Video, Right: Edited Video

Comparisons

We present comparisons of speech-driven video generation against state-of-the-art methods. From left to right, the videos display: Ground Truth, Aniportrait, Echomimic, Hallo3, and Ours.

Methods: Ground Truth, Aniportrait, Echomimic, Hallo3, Ours

BibTeX

@inproceedings{bao2026archon,
  title={Archon: A Unified Multimodal Model for Holistic Digital Human Generation},
  author={Bao, Chong and Liu, Shichen and Yu, Lijun and Futschik, David and Moschoglou, Stylianos and Srivastava, Shefali and Bai, Ziqian and Tan, Feitong and Zhang, Guofeng and Cui, Zhaopeng and Fanello, Sean and Zhang, Yinda},
  booktitle={The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
  year={2026}
}