QUICK REVIEW

[论文解读] CLEVRER-Humans: Describing Physical and Causal Events the Human Way

Jiayuan Mao, Xuelin Yang|arXiv (Cornell University)|Oct 5, 2023

Multimodal Machine Learning Applications被引用 8

一句话总结

CLEVRER-Humans 提供了一个人工标注的物理事件及其因果关系数据集，扩展了 CLEVRER，加入密集、多样化、由人类生成的事件描述与分级因果判断，并配备一个三阶段的数据收集流程。

ABSTRACT

Building machines that can reason about physical events and their causal relationships is crucial for flexible interaction with the physical world. However, most existing physical and causal reasoning benchmarks are exclusively based on synthetically generated events and synthetic natural language descriptions of causal relationships. This design brings up two issues. First, there is a lack of diversity in both event types and natural language descriptions; second, causal relationships based on manually-defined heuristics are different from human judgments. To address both shortcomings, we present the CLEVRER-Humans benchmark, a video reasoning dataset for causal judgment of physical events with human labels. We employ two techniques to improve data collection efficiency: first, a novel iterative event cloze task to elicit a new representation of events in videos, which we term Causal Event Graphs (CEGs); second, a data augmentation technique based on neural language generative models. We convert the collected CEGs into questions and answers to be consistent with prior work. Finally, we study a collection of baseline approaches for CLEVRER-Humans question-answering, highlighting the great challenges set forth by our benchmark.

研究动机与目标

在视频中的物理与因果推理上推动以人为本的评估，超越启发式规则。
创建多样化的人类标注的物理事件描述，以研究有据可依的语言理解与因果关系。
提供密集型的人类标注因果图表示（CEGs），可转换为用于基准测试的问答对。
提出结合迭代完形填空标注与神经描述增强的高效数据收集流程。

提出的方法

引入 Causal Event Graphs（CEGs），其节点是事件描述，边是有等级分数的人工判定因果影响。
使用迭代事件完形填空任务，从种子 CLEVRER 事件扩展事件描述（阶段 I）。
训练基于神经轨迹的生成器，扩展单对象与成对事件描述（阶段 II）。
通过后处理与人工筛选来确保质量、多样性，以及与视频轨迹的一致性。
通过人工边标注将扩展数据浓缩为密集型 CEGs（阶段 III）。
将 CEGs 转换为与 CLEVRER 兼容的 QA 对，通过抽取正确/错误选项来形成多项选择题。

实验结果

研究问题

RQ1人们如何描述并判断视频中物理事件之间的因果关系，超越启发式规则？
RQ2是否可以将密集的人类标注因果图框架（CEGs）转换为稳健的用于视频推理的 QA 数据集？
RQ3神经描述生成器结合有限的人类标注能否在规模上产生多样化、高质量的事件描述与因果注释？
RQ4将人类标注的因果判断转移到机器推理模型时会遇到哪些挑战？

主要发现

Model	Training	Per-Option (CLEVRER)	Per-Question (CLEVRER)	Per-Option (CLEVRER-Humans)	Per-Question (CLEVRER-Humans)
Best Guess	N/A	50.2	16.5	50.7	31.6
Lang-Only	Scratch	59.7	13.6	51.9 (±1.09)	30.4 (±1.90)
NS-DR [7]	Pretrain	87.6	79.6	51.0	32.0
VRDP [47]	Pretrain	96.3	91.9	50.9	31.6
CNN+LSTM	Pretrain	62.0	17.5	50.3	30.0
CNN+LSTM	Scratch	N/A	N/A	51.7 (±0.64)	34.2 (±1.69)
CNN+LSTM	Pretrain+Finetune	62.0	17.5	51.5 (±2.35)	30.8 (±0.69)
CNN+BERT	Pretrain	55.1	11.5	52.9	32.0
CNN+BERT	Scratch	N/A	N/A	52.0 (±2.34)	30.2 (±2.41)
CNN+BERT	Pretrain+Finetune	N/A	N/A	50.1 (±0.68)	30.4 (±3.09)
ALOE [43]	Pretrain	98.5	96.0	54.0	26.9
ALOE [43]	Scratch	N/A	N/A	51.8 (±1.00)	31.7 (±0.79)
ALOE [43]	Pretrain+Finetune	98.5	96.0	52.7 (±1.36)	32.1 (±1.36)
Human	N/A	N/A	N/A	84.5	71.4

CLEVRER-Humans 产生 1108 段视频、8581 条描述和 21167 条边注释，处理后生成 1076 对 QA。
数据集具有密集型 CEGs（平均每段视频 4.71 个节点、12.7 条边）与 219 的词汇量，涵盖 31 种不同动词，显著扩展了 CLEVRER 的事件多样性。
人类对因果性的判断与 CLEVRER 的启发式标签存在显著差异，但在某些情况下更接近反事实基线；人类评定的因果性以 1-5 的分级尺度进行。
用于 CLEVRER-Humans 的模型在性能上相较于 CLEVRER 存在较大差距，凸显多样性与数据效率的挑战，以及需要更好迁移与物理知识驱动建模的必要性。
作者展示了一种结合迭代 Cloze 标注与神经描述生成的数据收集流程，以提高数据效率。
评估显示没有现有模型在 CLEVRER-Humans 上明显超越随机基线，凸显人工标注因果推理任务的难度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。