QUICK REVIEW

[论文解读] Object Relational Graph with Teacher-Recommended Learning for Video Captioning

Ziqi Zhang, Yaya Shi|arXiv (Cornell University)|Feb 26, 2020

Multimodal Machine Learning Applications参考文献 44被引用 39

一句话总结

本文提出了一个基于 GCN 的关系推理的对象关系图编码器，以及一个教师推荐学习策略，该策略利用外部语言模型来提升视频字幕生成性能。

ABSTRACT

Taking full advantage of the information from both vision and language is critical for the video captioning task. Existing models lack adequate visual representation due to the neglect of interaction between object, and sufficient training for content-related words due to long-tailed problems. In this paper, we propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model. The ELM generates more semantically similar word proposals which extend the ground-truth words used for training to deal with the long-tailed problem. Experimental evaluations on three benchmarks: MSVD, MSR-VTT and VATEX show the proposed ORG-TRL system achieves state-of-the-art performance. Extensive ablation studies and visualizations illustrate the effectiveness of our system.

研究动机与目标

通过跨帧的对象交互来丰富视觉表示，从而推动更高水平的视频字幕生成。
在训练过程中通过引入外部语言模型的语言知识来应对长尾词分布。
开发一种将视觉关系推理与教师引导的语言学习相结合的训练策略，以提升泛化能力。

提出的方法

构建一个可学习的对象关系图 (ORG)，使用 GCN 来建模对象之间的时空交互。
实现两种图变体：帧内部分 ORG (P-ORG) 和跨视频的完全 ORG (C-ORG)，采用前 k 条连接。
引入 Teacher-Recommended Learning (TRL)，使用 External Language Model (ELM) 生成软目标，并通过多样的语言词汇建议来丰富训练。
用联合损失训练字幕模型，将对硬目标的交叉熵损失与来自 ELM 的软目标的 KL 散度结合起来（L = lambda * L_KL + (1-lambda) * L_CE）。
描述一个带有时空注意力的分层解码器用于单词生成，结合全局与局部上下文特征。

实验结果

研究问题

RQ1对象层面的关系推理如何提升视频字幕生成的视觉表示？
RQ2是否可以有效地将外部语言知识整合到字幕生成模型中，以缓解长尾词分布？
RQ3将基于 ORG 的关系编码与 TRL 相结合对标准视频字幕基准测试的影响如何？

主要发现

ORG 编码器通过使用 GCN 建模对象之间的交互（P-ORG 和 C-ORG）来改善对象表示。
TRL 利用离线的 ELM（如 BERT）提供软目标，缓解长尾词问题并提升字幕中的语言多样性。
结合的 ORG-TRL 系统在 MSVD、MSR-VTT 与 VATEX 基准测试上达到最先进性能。
消融研究表明，C-ORG 在 top-k 设置（k=5）下得到最佳结果，且 TRL 始终对性能有益。
定性结果显示更丰富、细致的字幕，能够捕捉对象关系和动作。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。