QUICK REVIEW

[论文解读] Deconfounded Image Captioning: A Causal Retrospect

Xu Yang, Hanwang Zhang|arXiv (Cornell University)|Mar 9, 2020

Multimodal Machine Learning Applications参考文献 104被引用 38

一句话总结

论文通过因果推断分析图像标题生成中的数据集偏差，并提出 DICv1.0，一种使用后门和前门调节的去混淆标题生成框架，以提升 CIDEr-D 分数。

ABSTRACT

Dataset bias in vision-language tasks is becoming one of the main problems which hinders the progress of our community. Existing solutions lack a principled analysis about why modern image captioners easily collapse into dataset bias. In this paper, we present a novel perspective: Deconfounded Image Captioning (DIC), to find out the answer of this question, then retrospect modern neural image captioners, and finally propose a DIC framework: DICv1.0 to alleviate the negative effects brought by dataset bias. DIC is based on causal inference, whose two principles: the backdoor and front-door adjustments, help us review previous studies and design new effective models. In particular, we showcase that DICv1.0 can strengthen two prevailing captioning models and can achieve a single-model 131.1 CIDEr-D and 128.4 c40 CIDEr-D on Karpathy split and online split of the challenging MS COCO dataset, respectively. Interestingly, DICv1.0 is a natural derivation from our causal retrospect, which opens promising directions for image captioning.

研究动机与目标

确定数据集偏差如何通过视觉-语言数据中的混淆因素扭曲图像标题生成的学习。
使用因果推断（后门和前门）开发有原理的去混淆方法，以学习真实的图像到标题的因果效应。
提出 DICv1.0 框架，通过减轻偏差来强化现有的标题生成模型。
从因果视角回顾主要的图像标题生成模型，以指导模型设计。

提出的方法

将偏差建模为影响图像特征 X 与标题 L 的混淆因素 D（以及 S）。
使用后门调整通过对混淆因素求平均来计算 P(L|do(X))：P(L|do(X)) = sum_d P(L|X,d) P(d).
使用前门调整通过中介 Z 处理未观测的混淆因素：P(L|do(X)) = sum_z P(z|X) sum_x P(L|z,x) P(x).
通过从 ConceptNet 选择一个中介 Z 作为常识结构，以及一个后门去混淆词汇表 S，来实例化 DICv1.0，使两种调整成为可能。
将 DICv1.0 应用于 Up-Down 与 AoANet 标题生成模型以提升 CIDEr-D：Up-Down 从 126.4 提升到 129.5；AoANet 从 128.7 提升到 131.1（MS COCO 测试，c40: 128.4）。

实验结果

研究问题

RQ1数据集偏置如何扭曲学习图像到标题的真实因果效应？
RQ2因果调整（后门和前门）是否能被实际应用于去混淆现代标题模型？
RQ3DICv1.0 框架是否在基准数据集上提升标准标题生成模型？
RQ4结构性词汇或常识三元组等中介在去混淳标题生成中扮演何种角色？

主要发现

DICv1.0 通过应用后门和前门调整来计算干预分布 P(L|do(X))，从而实现图像标题生成的去混淆。
将 DICv1.0 应用于 Up-Down 与 AoANet 在 MS COCO 上提升 CIDEr-D：126.4 → 129.5 与 128.7 → 131.1；AoANet 结果在测试服务器包含 128.4 CIDEr-c40。
框架采用中介 Z（常识结构）和后门去混淆词汇表 S，以减轻来自混淆因素 D 与 S 的偏差。
当后门调整由于复杂且未观测的混淆因素而不实用时，前门方法允许去混淆。
该工作提供了对主要标题生成模型的因果回顾，为设计因果去混淳的标题生成模型提供参考。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。