LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

Abstract

Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-models (LLMs) and vision-language-models (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation.

      Key Insight: Image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D
      transformations provides fine-grained and accurate guidance for open-world manipulation. We propose LAMP,
      which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations,
      enabling robust generalization across diverse manipulation tasks from monocular RGB-D observations and promptable instructions.
    

How It Works

1. Image Editing

Given a monocular RGB-D observation and a language instruction, an image-editing model generates the target post-manipulation state, depicting where and how objects should be rearranged.

2. Depth Lifting

Both the current and edited images are lifted into pixel-aligned 3D point clouds using monocular depth estimation, with mask-based cropping to preserve spatial detail.

3. Cross-State Registration

A robust registration pipeline with point cloud filtering, semantic matching via DINOv3 features, and scale alignment extracts precise 6-DoF inter-object transformations.

4. Goal-Conditioned Execution

The extracted transformation is converted into target poses with edit-informed grasp filtering and heuristic primitive-based motion planning for robust robotic execution.

Method

Given an RGB-D observation and a language instruction, LAMP uses image-editing to generate the target state, then lifts 2D spatial cues into 3D inter-object transformations through cross-state point cloud registration for execution.

Figure 1: Overview of LAMP. The image-editing model generates an edited state from the current observation and language instruction. Cross-state point cloud registration extracts the inter-object 3D transformation, which is converted into target poses for robotic execution.

Experimental Results

We evaluate LAMP on 13 real-world manipulation tasks spanning insertion, covering, assembly, stacking, articulated manipulation, and more. LAMP significantly outperforms prior methods across all task categories.

🔧

Fine-Grained Assembly & Insertion

LAMP recovers precise 6-DoF inter-object transformations for high-precision tasks — insertion, covering, stacking, and assembly — where millimeter-level accuracy is essential.

x3

Coin Insertion

x3

Toast Insertion

x3

Pencil Insertion

x3

Block Assembly

x3

Ring Stacking

x3

Lid Covering

x3

Pen-cap Covering

x3

Teapot Covering

💬

Language-Guided Promptable Manipulation

Different language instructions yield different manipulation behaviors on the same scene — the edited image prior faithfully reflects the semantic intent, guiding distinct 3D transformations.

x3

Instruction A "Place the pear upright onto the plate" — the pear is oriented vertically on its base

Instruction B "Place the pear lying on its side on the plate" — the pear is oriented horizontally on the surface

x3

1Move the white bowl onto the blue bowl

2Move the green bowl onto the white bowl

3Move the pink bowl onto the green bowl

⚙️

Articulated Object Manipulation

For articulated objects like drawers and toasters, LAMP treats the static housing as the passive object and extracts the transformation of the movable part from edited images.

x3

"Slide down the toaster lever"

x3

"Pour the tea"

🔗

Long-Horizon Multi-Step Tasks

LAMP chains multiple sub-tasks sequentially — each step conditions on the outcome of the previous one, demonstrating coherent multi-step planning and execution.

x3

1Open the drawer

2Pick up the duck

3Place the duck inside the drawer

4Close the drawer

x3

1Grasp the first egg

2Place it into the designated carton slot

3Repeat for all remaining eggs in sequence

x3

1Pick up the brush

2Insert the brush into the sauce bottle

3Apply the sauce onto the toast

4Place the brush into the bowl

BibTeX

@inproceedings{lamp, title={LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation}, author={Anonymous}, booktitle={}, year={} }

Lift ImAge-Editing as General 3D Priors for Open-world ManiPulation