QUICK REVIEW

[論文レビュー] A Benchmark for Systematic Generalization in Grounded Language Understanding

Laura Ruis, Jacob Andreas|arXiv (Cornell University)|Mar 11, 2020

Topic Modeling参考文献 37被引用数 45

ひとこと要約

tldr: 本論文は gSCAN を提案する。SCAN の grounded 拡張であり、グリッドワールド grounding 設定における広範な系統的組成一般化を評価し、ほとんどの generalization スプリットでベースラインモデルが大半で失敗することを示している。

ABSTRACT

Humans easily interpret expressions that describe unfamiliar situations composed from familiar parts ("greet the pink brontosaurus by the ferris wheel"). Modern neural networks, by contrast, struggle to interpret novel compositions. In this paper, we introduce a new benchmark, gSCAN, for evaluating compositional generalization in situated language understanding. Going beyond a related benchmark that focused on syntactic aspects of generalization, gSCAN defines a language grounded in the states of a grid world, facilitating novel evaluations of acquiring linguistically motivated rules. For example, agents must understand how adjectives such as 'small' are interpreted relative to the current world state or how adverbs such as 'cautiously' combine with new verbs. We test a strong multi-modal baseline model and a state-of-the-art compositional method finding that, in most cases, they fail dramatically when generalization requires systematic compositional rules.

研究の動機と目的

Motivate the need for systematic compositional generalization in grounded language understanding beyond non-grounded benchmarks like SCAN.
Introduce grounded SCAN (gSCAN) to evaluate various linguistically motivated generalization phenomena.
Provide a dataset design, world-state grounding, and evaluation splits to stress-test compositionality under grounding.
Assess baseline multimodal seq2seq models and GECA against eight generalization tasks to reveal current limitations.

提案手法

Extend SCAN by grounding language in a 2D grid-world state to create actionable instructions.
Define a world-state representation as a tensor Xs of size d×d×c with object properties (color, shape, size) and agent pose.
Use a multimodal seq2seq baseline that encodes commands with a BiLSTM and world state with a CNN, with a dual-attentional decoder producing action sequences.
Incorporate eight systematic generalization splits (compositional and length generalization) with controlled train/test differences.
Evaluate baseline and GECA (Good-enough Compositional Data Augmentation) on exact-match accuracy across splits.
Publicly release code and data generation for reproducibility.

実験結果

リサーチクエスチョン

RQ1Can neural models generalize to novel object-property combinations (e.g., color-shape pairs) not seen during training?
RQ2Do models show context-sensitive and relational generalization (e.g., size adjectives relative to world state)?
RQ3Can models apply adverbs and modifiers in novel compositions (e.g., cautiously with new verbs) under grounding?
RQ4How do models handle novel action-length generalization and grounding-based perturbations in instruction meaning?

主な発見

Split	Baseline	GECA
A: Random	97.69 ± 0.22	87.6 ± 1.19
B: Yellow squares	54.96 ± 39.39	34.92 ± 39.30
C: Red squares	23.51 ± 21.82	78.77 ± 6.63
D: Novel direction	0.00 ± 0.00	0.00 ± 0.00
E: Relativity	35.02 ± 2.35	33.19 ± 3.69
F: Class inference	92.52 ± 6.75	85.99 ± 0.85
G: Adverb k = 1	0.00 ± 0.00	0.00 ± 0.00
G: Adverb k = 5	0.47 ± 0.14	-
G: Adverb k = 10	2.04 ± 0.95	-
G: Adverb k = 50	4.63 ± 2.08	-
H: Adverb to verb	22.70 ± 4.59	11.83 ± 0.31
I: Length	2.10 ± 0.05	-

Baseline multimodal seq2seq models fail on most gSCAN splits, achieving high accuracy only on a random split (A).
GECA helps on some splits (notably red-squares zero-shot reference) but fails or offers limited gains on others, indicating limited transfer for grounded generalization.
Zero-shot grounding of color/shape combinations (e.g., red square referred via color) remains challenging for baselines, highlighting grounding-driven generalization gaps.
Novel direction and certain relational/relativistic references remain especially hard, with 0% exact matches on several splits for baselines and GECA in some cases.
Performance degrades with longer target sequences and with adverbs that alter action sequences in non-local, context-dependent ways.
Overall, gSCAN reveals substantial gaps in current neural models' ability to learn systematic compositional rules under grounding.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。