QUICK REVIEW

[论文解读] Auto-Encoding Scene Graphs for Image Captioning

Xu Yang, Kaihua Tang|arXiv (Cornell University)|Dec 6, 2018

Multimodal Machine Learning Applications参考文献 46被引用 26

一句话总结

该论文提出了一种新型框架——场景图自编码器（SGAE），通过在自然语言中对场景图进行自编码学习共享词典，将语言归纳偏置注入图像字幕生成任务。通过利用场景图作为符号化的中间表示，并在视觉与语言领域间共享学习到的词典，SGAE提升了模型的推理与泛化能力，在单模型设置下于MS-COCO Karpathy划分上实现了127.8的SOTA CIDEr-D得分。

ABSTRACT

We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inference in discourse. For example, when we see the relation `person on bike', it is natural to replace `on' with `ride' and infer `person riding bike on a road' even the `road' is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models less likely overfit to the dataset bias and focus on reasoning. Specifically, we use the scene graph --- a directed graph ($\mathcal{G}$) where an object node is connected by adjective nodes and relationship nodes --- to represent the complex structural layout of both image ($\mathcal{I}$) and sentence ($\mathcal{S}$). In the textual domain, we use SGAE to learn a dictionary ($\mathcal{D}$) that helps to reconstruct sentences in the $\mathcal{S} ightarrow \mathcal{G} ightarrow \mathcal{D} ightarrow \mathcal{S}$ pipeline, where $\mathcal{D}$ encodes the desired language prior; in the vision-language domain, we use the shared $\mathcal{D}$ to guide the encoder-decoder in the $\mathcal{I} ightarrow \mathcal{G} ightarrow \mathcal{D} ightarrow \mathcal{S}$ pipeline. Thanks to the scene graph representation and shared dictionary, the inductive bias is transferred across domains in principle. We validate the effectiveness of SGAE on the challenging MS-COCO image captioning benchmark, e.g., our SGAE-based single-model achieves a new state-of-the-art $127.8$ CIDEr-D on the Karpathy split, and a competitive $125.5$ CIDEr-D (c40) on the official server even compared to other ensemble models.

研究动机与目标

为解决端到端编码器-解码器模型在生成描述性、类人字幕方面的局限性，通过引入语言归纳偏置来改进。
通过使用场景图作为图像与句子的符号化、结构化表示，弥合视觉感知与语言构成之间的鸿沟。
学习一个共享的、可训练的词典，从仅文本的场景图重建中编码语言先验，并可迁移至视觉-语言任务。
通过利用语言数据中的上下文推理与搭配模式，提升推理能力并减少对数据集偏差的过拟合。

提出的方法

该方法使用场景图（G）将图像和句子表示为带有对象、属性和关系节点的有向图。
在自重建流水线S → G → D → S中训练场景图自编码器（SGAE），其中D为可训练词典，用于重新编码节点特征以捕捉语言归纳偏置。
词典D在视觉-语言流水线中共享：I → G → D → S，实现语言先验向图像字幕生成的迁移。
采用多模态图卷积网络（GCN）在图像到字幕的流水线中优化场景图特征，整合因检测不完善而缺失的视觉线索。
该框架与预训练视觉编码器及基于RNN的语言解码器集成，采用强化学习策略进行序列优化训练。
共享词典D充当工作记忆，将符号化推理与视觉感知解耦，降低特征表示中的领域差异。

实验结果

研究问题

RQ1能否有效提取并迁移语言归纳偏置（如搭配模式与上下文推理）以提升图像字幕生成质量？
RQ2从仅文本的场景图自编码中学习到的共享词典，是否能提升视觉-语言任务中的零样本或少样本泛化能力？
RQ3与端到端模型相比，引入符号化场景图表示是否能生成更具描述性且语境连贯的字幕？
RQ4当使用SGAE时，视觉场景图检测器的质量如何影响最终字幕模型的性能？

主要发现

基于SGAE的单模型在MS-COCO Karpathy划分上实现了127.8的SOTA CIDEr-D得分，优于所有先前方法。
即使使用更小的批量大小（100）和更少的训练轮次（250），该模型仍超越了使用批量大小1,024和250轮次的GCN-LSTM模型。
融合变体SGAE fuse在官方MS-COCO测试服务器上取得了125.5的竞争力CIDEr-D得分，优于集成模型。
人工评估显示，使用词典D生成的字幕显著比未使用D的字幕更具描述性，证实了学习到的归纳偏置的有效性。
句子重建消融实验表明，即使略微降低原始重建准确率，使用词典D也能对模型起到正则化作用并提升泛化能力。
结果表明，视觉场景图检测器的质量是关键瓶颈，因为即使语言先验强大，低质量的G仍会限制性能提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。