Skip to main content
QUICK REVIEW

[论文解读] A Benchmark for Systematic Generalization in Grounded Language Understanding

Laura Ruis, Jacob Andreas|arXiv (Cornell University)|Mar 11, 2020
Topic Modeling参考文献 37被引用 45
一句话总结

本文引入 gSCAN,一种对 SCAN 的具象化扩展,用于在网格世界着陆场景中评估广泛的系统性组合泛化,结果表明基线模型在大多数泛化拆分上表现不佳。

ABSTRACT

Humans easily interpret expressions that describe unfamiliar situations composed from familiar parts ("greet the pink brontosaurus by the ferris wheel"). Modern neural networks, by contrast, struggle to interpret novel compositions. In this paper, we introduce a new benchmark, gSCAN, for evaluating compositional generalization in situated language understanding. Going beyond a related benchmark that focused on syntactic aspects of generalization, gSCAN defines a language grounded in the states of a grid world, facilitating novel evaluations of acquiring linguistically motivated rules. For example, agents must understand how adjectives such as 'small' are interpreted relative to the current world state or how adverbs such as 'cautiously' combine with new verbs. We test a strong multi-modal baseline model and a state-of-the-art compositional method finding that, in most cases, they fail dramatically when generalization requires systematic compositional rules.

研究动机与目标

  • Motivate the need for systematic compositional generalization in grounded language understanding beyond non-grounded benchmarks like SCAN.
  • Introduce grounded SCAN (gSCAN) to evaluate various linguistically motivated generalization phenomena.
  • Provide a dataset design, world-state grounding, and evaluation splits to stress-test compositionality under grounding.
  • Assess baseline multimodal seq2seq models and GECA against eight generalization tasks to reveal current limitations.

提出的方法

  • Extend SCAN by grounding language in a 2D grid-world state to create actionable instructions.
  • Define a world-state representation as a tensor Xs of size d×d×c with object properties (color, shape, size) and agent pose.
  • Use a multimodal seq2seq baseline that encodes commands with a BiLSTM and world state with a CNN, with a dual-attentional decoder producing action sequences.
  • Incorporate eight systematic generalization splits (compositional and length generalization) with controlled train/test differences.
  • Evaluate baseline and GECA (Good-enough Compositional Data Augmentation) on exact-match accuracy across splits.
  • Publicly release code and data generation for reproducibility.

实验结果

研究问题

  • RQ1Can neural models generalize to novel object-property combinations (e.g., color-shape pairs) not seen during training?
  • RQ2Do models show context-sensitive and relational generalization (e.g., size adjectives relative to world state)?
  • RQ3Can models apply adverbs and modifiers in novel compositions (e.g., cautiously with new verbs) under grounding?
  • RQ4How do models handle novel action-length generalization and grounding-based perturbations in instruction meaning?

主要发现

SplitBaselineGECA
A: Random97.69 ± 0.2287.6 ± 1.19
B: Yellow squares54.96 ± 39.3934.92 ± 39.30
C: Red squares23.51 ± 21.8278.77 ± 6.63
D: Novel direction0.00 ± 0.000.00 ± 0.00
E: Relativity35.02 ± 2.3533.19 ± 3.69
F: Class inference92.52 ± 6.7585.99 ± 0.85
G: Adverb k = 10.00 ± 0.000.00 ± 0.00
G: Adverb k = 50.47 ± 0.14-
G: Adverb k = 102.04 ± 0.95-
G: Adverb k = 504.63 ± 2.08-
H: Adverb to verb22.70 ± 4.5911.83 ± 0.31
I: Length2.10 ± 0.05-
  • Baseline multimodal seq2seq models fail on most gSCAN splits, achieving high accuracy only on a random split (A).
  • GECA helps on some splits (notably red-squares zero-shot reference) but fails or offers limited gains on others, indicating limited transfer for grounded generalization.
  • Zero-shot grounding of color/shape combinations (e.g., red square referred via color) remains challenging for baselines, highlighting grounding-driven generalization gaps.
  • Novel direction and certain relational/relativistic references remain especially hard, with 0% exact matches on several splits for baselines and GECA in some cases.
  • Performance degrades with longer target sequences and with adverbs that alter action sequences in non-local, context-dependent ways.
  • Overall, gSCAN reveals substantial gaps in current neural models' ability to learn systematic compositional rules under grounding.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。