QUICK REVIEW

[論文レビュー] Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding

Kexin Yi, Jia-Jun Wu|arXiv (Cornell University)|Oct 4, 2018

Multimodal Machine Learning Applications参考文献 48被引用数 234

ひとこと要約

NS-VQAは神経的なシーン解析と象徴的なプログラム実行機を組み合わせ、構造化されたシーン表現を用いて推論を行い、CLEVRでほぼ完璧に近い精度を達成し、解釈可能な推論を伴うデータ効率の高い学習を実現する。

ABSTRACT

We marry two powerful ideas: deep representation learning for visual recognition and language understanding, and symbolic program execution for reasoning. Our neural-symbolic visual question answering (NS-VQA) system first recovers a structural scene representation from the image and a program trace from the question. It then executes the program on the scene representation to obtain an answer. Incorporating symbolic structure as prior knowledge offers three unique advantages. First, executing programs on a symbolic space is more robust to long program traces; our model can solve complex reasoning tasks better, achieving an accuracy of 99.8% on the CLEVR dataset. Second, the model is more data- and memory-efficient: it performs well after learning on a small number of training data; it can also encode an image into a compact representation, requiring less storage than existing methods for offline question answering. Third, symbolic program execution offers full transparency to the reasoning process; we are thus able to interpret and diagnose each execution step.

研究の動機と目的

VQAにおいて、視覚知覚と言語理解を推論から分離する動機付け。
シーンと質問を実行用の象徴的プログラムへパースする神経-象徴的アーキテクチャを提案。
CLEVR および関連データセット上で、象徴的実行のデータ効率、メモリ効率、解釈可能性を示す。

提案手法

Scene parser (de-renderer) は Mask R-CNN を用いて物体提案を生成し属性を予測する。続いて切り出した領域に ResNet-34 バックボーンを適用して空間属性を推定する。
Question parser (program generator) はアテンションベースの seq2seq モデル（双方向 LSTM エンコーダ、アテンション付き LSTM デコーダ）で、質問を階層的プログラムへ写像する。
Program executor は、生成されたプログラムに従って構造的シーン表現へ決定論的な機能モジュールのセットを適用し、回答を生成する。
Training は、少数の (question, program) ペアに対する質問パーサーの教師付き事前学習、続いて (question, answer) ペアでの REINFORCE 微調整を含む。
The executable program is fully symbolic and transparent, with modules arranged in a sequence starting from a scene token; errors trigger random sampling of outputs.
Memory efficiency is achieved by using compact structural representations (less than 100 bytes per image) compared to attention-based baselines.

実験結果

リサーチクエスチョン

RQ1Can a neural-symbolic VQA system disentangle perception, language understanding, and reasoning while preserving accuracy?
RQ2How data-efficient is a largely symbolic reasoning pipeline when learning from limited program annotations?
RQ3Does symbolic execution improve interpretability and generalization to unseen attribute combinations and human-generated questions?
RQ4Can the approach generalize to new visual domains (e.g., Minecraft) and maintain reasoning capabilities?

主な発見

方法	件数	存在	比較	比較	質問	総合
Humans (Johnson et al., 2017b)	86.7	96.6	86.4	96.0	95.0	92.6
CNN+LSTM+SAN (Johnson et al., 2017b)	59.7	77.9	75.1	70.8	80.9	73.2
N2NMN * (Hu et al., 2017)	68.5	85.7	84.9	88.7	90.0	83.7
Dependency Tree (Cao et al., 2018)	81.4	94.2	81.6	97.1	90.5	89.3
CNN+LSTM+RN (Santoro et al., 2017)	90.1	97.8	93.6	97.1	97.9	95.5
IEP * (Johnson et al., 2017b)	92.7	97.1	98.7	98.9	98.1	96.9
CNN+GRU+FiLM (Perez et al., 2018)	94.5	99.2	93.8	99.0	99.2	97.6
DDRprog * (Suarez et al., 2018)	96.5	98.8	98.4	99.0	99.1	98.3
MAC (Hudson and Manning, 2018)	97.1	99.5	99.1	99.5	99.5	98.9
TbD+reg+hres * (Mascharka et al., 2018)	97.6	99.2	99.4	99.6	99.5	99.1
NS-VQA (ours, 90 programs)	64.5	87.4	53.7	77.4	79.7	74.4
NS-VQA (ours, 180 programs)	85.0	92.9	83.4	90.6	92.2	89.5
NS-VQA (ours, 270 programs)	99.7	99.9	99.9	99.8	99.8	99.8

NS-VQA は CLEVR でほぼ完璧な精度を達成（270件の program 注釈で最大 99.8%）、従来法を凌駕。
オフライン QA のためのメモリは大幅に少なく済む（構造表現は画像あたり <100 バイト、注意ベース法は約 20KB）。
NS-VQA は基礎学習プログラム数が増えると基礎プログラムの正確度が高くなる（例：500 注釈で 88% のプログラム精度、9K でほぼ完璧）。
制限付き監視で、未見の属性組み合わせ（CLEVR-CoGenT）や人間生成の質問（CLEVR-Humans）へ一般化する。
NS-VQA は異なるシーン文脈（Minecraft）へも同等の推論能力で拡張可能だが、遮蔽は依然課題。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。