Vision-Language Models (VLMs) have become an essential component to enable robust robot manipulation. However, using them to translate human instructions into robot actions often forces a difficult trade-off between how well the VLM understands the instructions (VLM-comprehensibility) and how well the system generalizes to handle new tasks (action-generalizability). Inspired by context-free grammar, we present Semantic Assembly, namely SEAM, a novel intermediate representation to actively bridge this gap. Our new idea is to decompose actions into a concise, semantically-rich vocabulary and a VLM-friendly grammar, allowing the VLM to assemble instructions for diverse, unseen tasks without the need for tedious manual redesign or model re-training. This work will be presented in CVPR 2026.
Rethinking Intermediate Representation for VLM-based Robot Manipulation
Researchers
Introduction
The Main Impact
SEAM is a new paradigm for VLM-based robot control. The key impact is the SEAM representation, simultaneously enabling high VLM-comprehensibility and strong action-generalizability—a balance not seen in any prior methods.

The overall SEAM pipeline. Given the current observation and the task instruction, SEAM first generates (a) the Semantic Assembly Representation with designed vocabulary and grammar and (b) a translated intermediate representation. Next, we retrieve the corresponding support images and support masks from (c) the Retrieval Augmented Generation (RAG) Database and (d) segment the target object parts in the scene. Finally, we solve for the gripper’s trajectories to support (e) the robotic execution

Also, we developed a retrieval-augmented few-shot learning pipeline for fine-grained, open-vocabulary object part segmentation. Extensive real-world experiments demonstrate that SEAM improves robotic manipulation success rates by 15% over state-of-the-art methods, and our system achieves the fastest inference time for open-vocabulary segmentation.
We successfully validated our approach on the UR5 robot platform across eight diverse tasks, from precise insertion to articulated object manipulation.



