QUICK REVIEW

[论文解读] Training Recurrent Answering Units with Joint Loss Minimization for VQA

Hyeonwoo Noh, Bohyung Han|arXiv (Cornell University)|Jun 12, 2016

Multimodal Machine Learning Applications参考文献 26被引用 69

一句话总结

该论文提出了一种具有共享参数回答单元的循环视觉问答模型，通过在多个推理步骤中联合最小化损失来实现性能提升。通过在训练过程中对过拟合单元进行早停，并在推理时仅使用第一个单元，该方法在无需数据增强的情况下实现了VQA数据集上的最先进性能，优于固定步数的多步模型。

ABSTRACT

We propose a novel algorithm for visual question answering based on a recurrent deep neural network, where every module in the network corresponds to a complete answering unit with attention mechanism by itself. The network is optimized by minimizing loss aggregated from all the units, which share model parameters while receiving different information to compute attention probability. For training, our model attends to a region within image feature map, updates its memory based on the question and attended image feature, and answers the question based on its memory state. This procedure is performed to compute loss in each step. The motivation of this approach is our observation that multi-step inferences are often required to answer questions while each problem may have a unique desirable number of steps, which is difficult to identify in practice. Hence, we always make the first unit in the network solve problems, but allow it to learn the knowledge from the rest of units by backpropagation unless it degrades the model. To implement this idea, we early-stop training each unit as soon as it starts to overfit. Note that, since more complex models tend to overfit on easier questions quickly, the last answering unit in the unfolded recurrent neural network is typically killed first while the first one remains last. We make a single-step prediction for a new question using the shared model. This strategy works better than the other options within our framework since the selected model is trained effectively from all units without overfitting. The proposed algorithm outperforms other multi-step attention based approaches using a single step prediction in VQA dataset.

研究动机与目标

为解决视觉问答（VQA）问题中推理步骤数量最优值难以预定义的问题，因为每个问题的最优步骤数各不相同。
通过联合优化多个共享参数的回答单元，提升VQA任务中的泛化能力和性能。
开发一种训练策略，防止后期推理步骤中的过拟合，同时保留早期更鲁棒单元所学习的知识。
通过联合损失最小化和渐进式早停策略，利用所有单元的知识，实现在单步推理中的高效预测。

提出的方法

该模型采用循环架构，每个回答单元处理图像和问题特征，应用注意力机制聚焦于图像中的相关区域，并更新其记忆状态。
所有回答单元共享相同的模型参数，但接收不同的上下文信息：早期单元接收来自前序步骤的特征，从而实现分层推理。
通过最小化联合损失函数进行训练，该函数聚合所有单元的损失，促使每个单元共同参与整体预测。
对每个单元应用早停策略：一旦其验证准确率开始下降，立即终止该单元的训练，以防止过拟合。
在推理阶段，仅使用第一个回答单元进行预测，因其最为鲁棒且基于所有其他单元的知识进行训练。
通过联合优化，使各单元能针对不同推理深度进行专业化，从而隐式学习每个问题的最优步骤数。

实验结果

研究问题

RQ1具有共享参数回答单元的循环VQA模型，能否通过在多个推理步骤中联合优化来提升性能？
RQ2在训练过程中对过拟合单元实施早停，是否能提升单步推理中的泛化能力？
RQ3通过多步联合损失训练的单个回答单元，能否超越具有固定预设推理深度的模型？
RQ4渐进式早停策略如何影响模型处理需要不同推理步骤数问题的能力？

主要发现

所提出的Ours_FULL方法在使用VGG-16特征的VQA数据集上达到63.2%的测试开发集准确率，优于其他基于多步注意力的模型。
从仅使用单步基线的Ours_SS到采用联合损失和早停策略的Ours_FULL，性能提升了2.3个百分点，这在VQA任务中属于显著提升。
当使用ResNet-101特征时，模型在测试开发集上达到67.3%的准确率，在测试标准集上达到61.0%的准确率，表明其在更优图像编码器下具有出色的可扩展性。
可视化结果表明，Ours_FULL能聚焦于语义相关的图像区域，而Ours_SS常被无关物体干扰，说明其注意力学习更优。
用于推理的第一个回答单元表现最佳，因其在所有单元的知识基础上进行训练且未发生过拟合，这得益于早停机制。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。