A Retrieval-Augmented Framework Enabling Spatial-Awareness in VLM for Object-Centric Robot Manipulation

Researchers

Prof. Qi Dou
Prof. Qi Dou
Mr. Kai Chen
Mr. Kai Chen

Introduction

Connecting the semantic reasoning of Vision-Language Models (VLMs) to the precise geometric demands of robotic manipulation remains a fundamental challenge. While VLMs can interpret high-level commands, they lack the intrinsic spatial intelligence required for tasks demanding precise object placement, orientation, and physical reasoning. Here, we introduce Retrieval-Augmented Manipulation (RAM), an object-centric framework that endows general-purpose vision foundation models with the spatial reasoning necessary for robust manipulation.  RAM bridges the semantic-to-geometric gap by grounding abstract concepts into an explicit, object-centric 3D representation.  This grounded information is then provided as augmented context to the VLM, empowering it to decompose complex instructions into a sequence of spatially-precise and physically-plausible sub-goals. We demonstrate that RAM, in a zero-shot setting on a real-world robot, can execute these sub-goals to fulfil complex spatial language instructions, complete spatially aware manipulation under the guidance of a single 2D image, and adaptively replan tasks by reasoning about physical constraints like object size and collisions. Quantitative evaluations on the CO3D datasets also validate that RAM's core vision module generalizes to novel object categories and is robust to significant variations in shape and occlusions. By providing a structured bridge between semantic intent and geometric execution, RAM represents a critical step toward developing more physically intelligent and general-purpose robotic systems.

The Main Impact

1

We propose RAM, a Retrieval-Augmented Manipulation framework that augments VLM-based robot planning with an explicit external knowledge base of category-level, object-centric spatial priors, instead of trying to implicitly bake all spatial knowledge into the VLM parameters.

Overview of the Retrieval-Augmented Manipulation (RAM) framework

Illustration and results of RAM in image-guided spatially-aware manipulation tasks

2

We build an extensible object-centric engine containing per-category canonical templates annotated with geometry- and manipulation-relevant priors, and transfer these priors to the specific object instance in the scene using a generalizable vision foundation model.

3

We show RAM improves spatially-aware manipulation, by enabling the VLM to generate spatially precise, physically plausible substeps with explicit spatial constraints that can be optimized into executable trajectories, validated via experiments on 14 spatially-aware manipulation scenarios spanning instruction following, image-guided manipulation, and complex spatial reasoning.
 

Illustration and results of RAM in dexterous manipulation on HKCLR-Robot