Rethinking Intermediate Representation for VLM-based Robot Manipulation

Researchers

Prof. Philip Fu
Prof. Philip Fu
Mr. Weiliang Tang
Mr. Weiliang Tang
Ms. Jialin Gao
Ms. Jialin Gao

Introduction

Vision-Language Models (VLMs) have become an essential component to enable robust robot manipulation. However, using them to translate human instructions into robot actions often forces a difficult trade-off between how well the VLM understands the instructions (VLM-comprehensibility) and how well the system generalizes to handle new tasks (action-generalizability). Inspired by context-free grammar, we present Semantic Assembly, namely SEAM, a novel intermediate representation to actively bridge this gap. Our new idea is to decompose actions into a concise, semantically-rich vocabulary and a VLM-friendly grammar, allowing the VLM to assemble instructions for diverse, unseen tasks without the need for tedious manual redesign or model re-training.  This work will be presented in CVPR 2026.

The Main Impact

1

SEAM is a new paradigm for VLM-based robot control. The key impact is the SEAM representation, simultaneously enabling high VLM-comprehensibility and strong action-generalizability—a balance not seen in any prior methods.

The overall SEAM pipeline. Given the current observation and the task instruction, SEAM first generates (a) the Semantic Assembly Representation with designed vocabulary and grammar and (b) a translated intermediate representation. Next, we retrieve the  corresponding support images and support masks from (c) the Retrieval Augmented Generation (RAG) Database and (d) segment the target object parts in the scene. Finally, we solve for the gripper’s trajectories to support (e) the robotic execution

SEAM produces accurate instance masks on unseen objects, outperforming state-of-the-art methods
2

Also, we developed a retrieval-augmented few-shot learning pipeline for fine-grained, open-vocabulary object part segmentation. Extensive real-world experiments demonstrate that SEAM improves robotic manipulation success rates by 15% over state-of-the-art methods, and our system achieves the fastest inference time for open-vocabulary segmentation.

3

We successfully validated our approach on the UR5 robot platform across eight diverse tasks, from precise insertion to articulated object manipulation.

Execution sequence on eight different tasks