QUICK REVIEW

[论文解读] Visual Commonsense Graphs: Reasoning about the Dynamic Context of a Still Image.

Jae Sung Park, Chandra Bhagavatula|arXiv (Cornell University)|Apr 22, 2020

Multimodal Machine Learning Applications参考文献 9被引用 4

一句话总结

本文提出了 VisualComet，一种用于视觉常识推理的框架，通过大规模数据集（包含60,000张图像的140万条标注文本推理，附带视频摘要和人物定位信息），从单张图像中预测过去事件、未来事件以及当前意图。主要贡献在于证明了整合的视觉-文本常识推理显著优于非整合方法。

ABSTRACT

Even from a single frame of a still image, people can reason about the dynamic story of the image before, after, and beyond the frame. For example, given an image of a man struggling to stay afloat in water, we can reason that the man fell into the water sometime in the past, the intent of that man at the moment is to stay alive, and he will need help in the near future or else he will get washed away. We propose VisualComet, the novel framework of visual commonsense reasoning tasks to predict events that might have happened before, events that might happen next, and the intents of the people at present. To support research toward visual commonsense reasoning, we introduce the first large-scale repository of Visual Commonsense Graphs that consists of over 1.4 million textual descriptions of visual commonsense inferences carefully annotated over a diverse set of 60,000 images, each paired with short video summaries of before and after. In addition, we provide person-grounding (i.e., co-reference links) between people appearing in the image and people mentioned in the textual commonsense descriptions, allowing for tighter integration between images and text. We establish strong baseline performances on this task and demonstrate that integration between visual and textual commonsense reasoning is the key and wins over non-integrative alternatives.

研究动机与目标

实现对动态叙事线（即静态图像之前、期间和之后的事件）的推理，超越静态视觉感知。
解决缺乏大规模、结构化数据集以支持具有时间与社会上下文的视觉常识推理的问题。
开发一种整合视觉与文本常识推理的框架，以提升推理准确性。
通过在图像实体与文本描述之间建立人物定位链接，增强跨模态对齐。

提出的方法

构建一个大规模数据集，包含60,000张图像的140万条文本常识推理，每张图像均配有事件前后短时视频摘要。
对每张图像标注三类推理：图像发生前的事件、图像发生后的事件，以及场景中人物的当前意图。
引入人物定位标注，将图像中的人物与文本描述中的对应提及关联，以支持跨模态指代消解。
设计一种联合视觉-文本推理模型，结合图像特征与文本常识知识，以预测动态叙事线。
使用 VisualComet 基准训练并评估模型，将整合的视觉-文本推理与非整合基线进行对比。
采用注意力机制与多模态Transformer，融合视觉与文本表征，以提升推理性能。

实验结果

研究问题

RQ1模型能否基于视觉与文本线索，准确预测静态图像之前和之后发生的事件？
RQ2与单一模态推理相比，联合视觉与文本常识推理的效率如何？
RQ3人物定位在多大程度上提升了视觉常识任务中动态叙事预测的准确性？
RQ4在推理中引入事件前后视频摘要，对常识推理性能有何影响？
RQ5大规模、结构化的视觉常识推理数据集是否能显著提升推理基准的表现？

主要发现

所提出的 VisualComet 框架在视觉常识推理任务中表现优异，证明了联合视觉-文本推理显著优于非整合基线方法。
视觉与文本常识推理的整合在预测过去、未来及意图相关推理方面带来了可测量的性能提升。
引入人物定位显著增强了图像实体与文本描述之间的对齐，从而提升了推理准确性。
包含140万条推理、覆盖60,000张图像的大规模数据集，为未来视觉常识推理研究提供了稳健的基准。
事件前后视频摘要的引入，显著提升了模型对静态图像中时间动态的理解能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。