Skip to main content
QUICK REVIEW

[论文解读] Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

Daniel Keysers, Nathanael Schärli|arXiv (Cornell University)|Dec 20, 2019
Topic Modeling参考文献 39被引用 141
一句话总结

本文形式化了基于分布的组成性评估(DBCA),构建了用于衡量组成性泛化的 CFQ 数据集,并显示当训练/测试之间的化合发散增大时,标准模型表现不佳,同时保持原子分布相似。

ABSTRACT

State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings.

研究动机与目标

  • Define a principled method to assess compositional generalization using train/test splits that maximize compound divergence while keeping atom distributions similar.
  • Introduce CFQ, a large realistic NLQ→SPARQL dataset designed for compositionality evaluation.
  • Provide a framework to construct and compare compositionality splits across datasets (CFQ and SCAN).
  • Analyze baseline neural architectures on these splits to quantify their compositional generalization capabilities.

提出的方法

  • Introduce Distribution-Based Compositionality Assessment (DBCA) to quantify atom and compound divergences between train and test sets.
  • Represent each example as composed of atoms (rules) and compounds (rule applications) and compute divergences using weighted subgraph distributions and Chernoff coefficients (Bhattacharyya for atoms, 0.1-parameter for compounds).
  • Construct CFQ via automatic, rule-based generation with an explicit DAG of rule applications to track atoms/compounds.
  • Use an iterative greedy algorithm to create train/test splits with target compound divergence and constrained atom divergence (≤0.02).
  • Provide comparisons to other compositional splits (e.g., output/input length, pattern-based splits) and analyze across CFQ and SCAN.
  • Evaluate three baselines (LSTM+attention, Transformer, Universal Transformer) on CFQ and SCAN under various divergence-based splits.

实验结果

研究问题

  • RQ1How to quantify the suitability of a split for measuring compositional generalization (DBCA principles)?
  • RQ2What is the impact of maximizing compound divergence while keeping atom divergence low on model performance?
  • RQ3Do state-of-the-art architectures generalize compositionally on realistic benchmarks like CFQ and SCAN?
  • RQ4Can CFQ and the proposed splits reveal robustness gaps in neural models for semantic parsing and navigation tasks?

主要发现

  • Baseline architectures (LSTM+attention, Transformer, Universal Transformer) fail to generalize compositionally on CFQ MCD splits (mean accuracy < 20%).
  • There is a strong negative correlation between compound divergence and accuracy across all models and tasks.
  • CFQ and SCAN splits with maximum compound divergence but low atom divergence yield higher difficulty than random or other traditional splits.
  • On CFQ, random splits yield >95% accuracy, but MCD splits drop substantially across all models (e.g., ~14.9–18.9% vs ~97–99%).
  • Compound divergence is a strong predictor of test accuracy, more so than simple length-based or pattern-based split criteria.
  • CFQ provides richer compositional annotations and more diverse query patterns than prior semantic-parsing datasets, enabling robust compositionality analysis.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。