Lift ImAge-Editing as General 3D Priors for Open-world ManiPulation

Anonymous Authors

Abstract

Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-models (LLMs) and vision-language-models (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation.

Key Insight: Image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. We propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations, enabling robust generalization across diverse manipulation tasks from monocular RGB-D observations and promptable instructions.

How It Works

1. Image Editing

Given a monocular RGB-D observation and a language instruction, an image-editing model generates the target post-manipulation state, depicting where and how objects should be rearranged.

2. Depth Lifting

Both the current and edited images are lifted into pixel-aligned 3D point clouds using monocular depth estimation, with mask-based cropping to preserve spatial detail.

3. Cross-State Registration

A robust registration pipeline with point cloud filtering, semantic matching via DINOv3 features, and scale alignment extracts precise 6-DoF inter-object transformations.

4. Goal-Conditioned Execution

The extracted transformation is converted into target poses with edit-informed grasp filtering and heuristic primitive-based motion planning for robust robotic execution.

Method

Given an RGB-D observation and a language instruction, LAMP uses image-editing to generate the target state, then lifts 2D spatial cues into 3D inter-object transformations through cross-state point cloud registration for execution.

LAMP Method Pipeline

Figure 1: Overview of LAMP. The image-editing model generates an edited state from the current observation and language instruction. Cross-state point cloud registration extracts the inter-object 3D transformation, which is converted into target poses for robotic execution.

Experimental Results

We evaluate LAMP on 13 real-world manipulation tasks spanning insertion, covering, assembly, stacking, articulated manipulation, and more. LAMP significantly outperforms prior methods across all task categories.

🔧

Fine-Grained Assembly & Insertion

LAMP recovers precise 6-DoF inter-object transformations for high-precision tasks — insertion, covering, stacking, and assembly — where millimeter-level accuracy is essential.

x3
Coin Insertion
x3
Toast Insertion
x3
Pencil Insertion
x3
Block Assembly
x3
Ring Stacking
x3
Lid Covering
x3
Pen-cap Covering
x3
Teapot Covering
💬

Language-Guided Promptable Manipulation

Different language instructions yield different manipulation behaviors on the same scene — the edited image prior faithfully reflects the semantic intent, guiding distinct 3D transformations.

x3
Instruction A "Place the pear upright onto the plate" — the pear is oriented vertically on its base
Instruction B "Place the pear lying on its side on the plate" — the pear is oriented horizontally on the surface
x3
1Move the white bowl onto the blue bowl
2Move the green bowl onto the white bowl
3Move the pink bowl onto the green bowl
⚙️

Articulated Object Manipulation

For articulated objects like drawers and toasters, LAMP treats the static housing as the passive object and extracts the transformation of the movable part from edited images.

x3
"Slide down the toaster lever"
x3
"Pour the tea"
🔗

Long-Horizon Multi-Step Tasks

LAMP chains multiple sub-tasks sequentially — each step conditions on the outcome of the previous one, demonstrating coherent multi-step planning and execution.

x3
1Open the drawer
2Pick up the duck
3Place the duck inside the drawer
4Close the drawer
x3
1Grasp the first egg
2Place it into the designated carton slot
3Repeat for all remaining eggs in sequence
x3
1Pick up the brush
2Insert the brush into the sauce bottle
3Apply the sauce onto the toast
4Place the brush into the bowl

BibTeX

@inproceedings{lamp, title={LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation}, author={Anonymous}, booktitle={}, year={} }