QUICK REVIEW

[Paper Review] CLIPort: What and Where Pathways for Robotic Manipulation

Mohit Shridhar, Lucas Manuelli|arXiv (Cornell University)|Sep 24, 2021

Multimodal Machine Learning Applications65 references99 citations

TL;DR

CLIPort introduces a two-stream, language-conditioned manipulation framework that fuses a semantic stream from CLIP with a spatial Transporter-based stream to ground language in fine-grained actions, achieving strong few-shot and multi-task generalization in simulation and real robots.

ABSTRACT

How can we imbue robots with the ability to manipulate objects precisely but also to reason about them in terms of abstract concepts? Recent works in manipulation have shown that end-to-end networks can learn dexterous skills that require precise spatial reasoning, but these methods often fail to generalize to new goals or quickly learn transferable concepts across tasks. In parallel, there has been great progress in learning generalizable semantic representations for vision and language by training on large-scale internet data, however these representations lack the spatial understanding necessary for fine-grained manipulation. To this end, we propose a framework that combines the best of both worlds: a two-stream architecture with semantic and spatial pathways for vision-based manipulation. Specifically, we present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2]. Our end-to-end framework is capable of solving a variety of language-specified tabletop tasks from packing unseen objects to folding cloths, all without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures. Experiments in simulated and real-world settings show that our approach is data efficient in few-shot settings and generalizes effectively to seen and unseen semantic concepts. We even learn one multi-task policy for 10 simulated and 9 real-world tasks that is better or comparable to single-task policies.

Motivation & Objective

Ground abstract semantic concepts (what) to precise spatial actions (where) for manipulation.
Enable language-conditioned control that transfers concepts across tasks.
Achieve data-efficient learning with few demonstrations and support multi-task learning.
Demonstrate transfer from simulated to real-world robotics with minimal data.

Proposed method

Adopt a two-stream architecture: semantic stream conditioned by pre-trained CLIP features and a spatial stream that handles RGB-D input.
Formulate manipulation as pick-and-place affordance predictions using Transporter-style FCNs for pick and place Q-functions.
Condition the semantic stream on CLIP language encodings and tile the language features into decoder layers.
Train via imitation learning from demonstrations with cross-entropy losses over pixelwise action maps.
Use a two-step action primitive (start and end-effector poses) with translationally-equivariant networks.
Extend to multi-task and unseen attribute generalization by randomizing tasks and attributes across demonstrations.

Experimental results

Research questions

RQ1How effective is the language-conditioned two-stream architecture for fine-grained manipulation versus single-stream or baseline approaches?
RQ2Can a single multi-task model generalize across multiple language-conditioned tasks including unseen attributes?
RQ3To what extent do semantic attributes (colors, shapes, object categories) generalize to seen and unseen scenarios?
RQ4How well does the approach transfer from simulation to real-world robotic manipulation with limited data?

Key findings

Two-stream CLIPort outperforms Transporter-only and CLIP-only baselines, achieving high success with fewer demonstrations (e.g., single-task CLIPort exceeds 90% with 100 demonstrations).
A multi-task CLIPport model trained on 10 tasks can match or outperform single-task models on many tasks, showing effective cross-task generalization.
For seen attributes, CLIPort (single) performs well; for unseen attributes, grounding is harder but explicit transfer in multi-task settings (CLIPort multi-attr) improves performance substantially.
In real-world robot experiments, a multi-task model trained with about 179 image-action pairs achieves meaningful success across 9 tasks, with performance around 70% on simple tasks.
Unseen attributes lead to lower performance overall, but benefits emerge when leveraging semantic transfers across tasks (e.g., pink blocks help solving unseen-color tasks).
The framework demonstrates data efficiency in few-shot settings and supports training a single policy for multiple tasks that is competitive with or superior to single-task policies.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.