QUICK REVIEW

[论文解读] Pythia v0.1: the Winning Entry to the VQA Challenge 2018

Yu Jiang, Vivek Natarajan|arXiv (Cornell University)|Jul 26, 2018

Multimodal Machine Learning Applications参考文献 16被引用 165

一句话总结

Pythia v0.1 是一个模块化的 VQA 框架，通过架构调整、学习计划、特征微调、数据增强和多样化集成来改进 up-down 注意力模型，在 VQA v2.0 上达到最先进的结果。

ABSTRACT

This document describes Pythia v0.1, the winning entry from Facebook AI Research (FAIR)'s A-STAR team to the VQA Challenge 2018. Our starting point is a modular re-implementation of the bottom-up top-down (up-down) model. We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2.0 dataset -- from 65.67% to 70.22%. Furthermore, by using a diverse ensemble of models trained with different features and on different datasets, we are able to significantly improve over the 'standard' way of ensembling (i.e. same model with different random seeds) by 1.31%. Overall, we achieve 72.27% on the test-std split of the VQA v2.0 dataset. Our code in its entirety (training, evaluation, data-augmentation, ensembling) and pre-trained models are publicly available at: https://github.com/facebookresearch/pythia

研究动机与目标

推动开发一个名为 Pythia 的模块化 VQA 研究平台。
展示有针对性的架构和训练变更如何提升 VQA 的准确性。
证明数据增强和微调特征可以提升性能。
探索网格特征和超越标准种子的一致性集成带来的多样性收益。

提出的方法

将 bottom-up top-down（up-down）注意力模型重新实现为一个模块化框架。
用权重归一化和 ReLU 替换门控 tanh；使用 Hadamard 积进行融合，并采用 sigmoid 分类器。
使用 300D 的 GloVe 嵌入、基于 GRU 的问题编码，以及一个问题注意力模块。
应用 Adamax，采用热身学习率调度和分步学习率衰减来改善训练。
使用 Detectron FPN 基的检测器对 bottom-up 特征进行微调，并使用 2048D fc6/fc7 特征。
通过 Visual Genome 与 VisDial 进行数据增强，使用左右互换的令牌交换来镜像图像；结合网格特征与 100 个边界框候选。
构建两种集成： (i) 相同模型种子；(ii) 使用不同特征与数据源训练的多样化模型。

实验结果

研究问题

RQ1Can modularizing VQA research into interchangeable components improve reuse and performance?
RQ2What is the impact of architectural tweaks (activation, fusion), learning rate schedules, and feature fine-tuning on VQA accuracy?
RQ3Do data augmentation and additional grid-based image features improve performance beyond bottom-up features alone?
RQ4Does diverse-model ensembling outperform ensembles built from identical architectures with different seeds?

主要发现

Baseline up-down achieved 65.32% test-dev and 65.67% test-std.
Adaptations to the architecture raised test-dev to 66.91% (no test-std reported).
Learning schedule improvements raised test-dev to 68.05%.
Fine-tuning bottom-up features raised test-dev to 68.49%.
Data augmentation raised test-dev to 69.24%.
Grid features raised test-dev to 69.81%.
Using 100 object proposals raised test-dev to 70.01% and test-std to 70.24%.
Ensembling 30 diverse models yielded 72.18% test-dev and 72.27% test-std (state-of-the-art).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。