[论文解读] Vision-Language Models as Success Detectors
论文提出 SuccessVQA,这是一个对 Flamingo(一个视觉-语言模型)进行微调的框架,通过将成功检测重新表述为视觉问答任务,在跨越多领域实现对语言和视觉变体的零-shot泛化。
Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language models (Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we treat success detection as a visual question answering (VQA) problem, denoted SuccessVQA. We study success detection across three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models in out-of-distribution test scenarios with either variation. In the last domain of "in-the-wild" human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting future work. We hope our initial results encourage further work in real world success detection and reward modelling.
研究动机与目标
- Motivate robust, generalisable success detectors as rewards or evaluators for agents.
- Leverage large pretrained vision-language models to generalise across language and visual variations.
- Unify success detection across diverse domains using a single training framework.
- Demonstrate the benefits of SuccessVQA in simulated IA Playroom, robotics manipulation, and Ego4D datasets.
提出的方法
- Formulate success detection as a visual question answering (VQA) task called SuccessVQA.
- Finite-tune Flamingo (3B) by updating vision components while keeping language components frozen.
- Create SuccessVQA datasets by segmenting trajectories into clips and annotating success points from human labels.
- Generate questions from task templates or narrations and label answers as Yes/No depending on success frames.
- Evaluate on three domains with both in-distribution and out-of-distribution language/vision variations.
- Compare against bespoke domain-specific success detectors as baselines.
实验结果
研究问题
- RQ1Can a Flamingo-based success detector generalise to unseen language expressions of tasks?
- RQ2How robust is SuccessVQA to unseen visual variations (camera viewpoints, distractors) in robotic and real-world settings?
- RQ3Does SuccessVQA outperform bespoke reward models in out-of-distribution scenarios?
- RQ4Is SuccessVQA capable of handling in-the-wild, egocentric video data for success detection?
主要发现
| 模型 | Test 1 (unseen episodes) | Test 2 (unseen behaviour) | Test 3 (unseen tasks) |
|---|---|---|---|
| bespoke SD | 80.6% | 85.4% | 49.9% |
| FT Flamingo 3B | 83.4% | 85.0% | 59.3% |
- Finetuned Flamingo closely matches bespoke SD in unseen episodes and unseen behaviours in IA Playroom.
- In unseen tasks, FT Flamingo 3B outperforms the bespoke model by about 10 percentage points in episode-level accuracy.
- Flamingo-based detectors show greater robustness to viewpoint changes and distractors than bespoke models, often remaining within a few percentage points of Test 1 accuracy.
- Initial Ego4D experiments indicate the task is very challenging, but provide a promising direction for real-world success detection with more work.
- Across domains, a single multimodal backbone with minimal task-specific changes can achieve competitive performance against domain-specific reward models.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。