QUICK REVIEW

[论文解读] Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding

Kexin Yi, Jia-Jun Wu|arXiv (Cornell University)|Oct 4, 2018

Multimodal Machine Learning Applications参考文献 48被引用 234

一句话总结

NS-VQA 将神经场景解析与符号程序执行器相结合，以对结构化场景表示进行推理，在 CLEVR 上实现近乎完美的准确率并且具备可解释的推理的数据高效学习。

ABSTRACT

We marry two powerful ideas: deep representation learning for visual recognition and language understanding, and symbolic program execution for reasoning. Our neural-symbolic visual question answering (NS-VQA) system first recovers a structural scene representation from the image and a program trace from the question. It then executes the program on the scene representation to obtain an answer. Incorporating symbolic structure as prior knowledge offers three unique advantages. First, executing programs on a symbolic space is more robust to long program traces; our model can solve complex reasoning tasks better, achieving an accuracy of 99.8% on the CLEVR dataset. Second, the model is more data- and memory-efficient: it performs well after learning on a small number of training data; it can also encode an image into a compact representation, requiring less storage than existing methods for offline question answering. Third, symbolic program execution offers full transparency to the reasoning process; we are thus able to interpret and diagnose each execution step.

研究动机与目标

Motivate disentangling visual perception and language understanding from reasoning in VQA.
Propose a neural-symbolic architecture that parses scenes and questions into a symbolic program for execution.
Demonstrate data efficiency, memory efficiency, and interpretability of symbolic execution on CLEVR and related datasets.

提出的方法

Scene parser (de-renderer) uses Mask R-CNN to generate object proposals and predict attributes, then uses a ResNet-34 backbone on cropped segments for spatial attributes.
Question parser (program generator) is an attention-based seq2seq model (bidirectional LSTM encoder, LSTM decoder with attention) that maps questions to hierarchical programs.
Program executor applies a deterministic set of functional modules to the structural scene representation according to the generated program to produce answers.
Training involves supervised pretraining of the question parser on a small set of (question, program) pairs, followed by REINFORCE fine-tuning on (question, answer) pairs.
The executable program is fully symbolic and transparent, with modules arranged in a sequence starting from a scene token; errors trigger random sampling of outputs.
Memory efficiency is achieved by using compact structural representations (less than 100 bytes per image) compared to attention-based baselines.

实验结果

研究问题

RQ1Can a neural-symbolic VQA system disentangle perception, language understanding, and reasoning while preserving accuracy?
RQ2How data-efficient is a largely symbolic reasoning pipeline when learning from limited program annotations?
RQ3Does symbolic execution improve interpretability and generalization to unseen attribute combinations and human-generated questions?
RQ4Can the approach generalize to new visual domains (e.g., Minecraft) and maintain reasoning capabilities?

主要发现

Methods	Count	Exist	Compare	Compare	Query	Overall
Humans (Johnson et al., 2017b)	86.7	96.6	86.4	96.0	95.0	92.6
CNN+LSTM+SAN (Johnson et al., 2017b)	59.7	77.9	75.1	70.8	80.9	73.2
N2NMN * (Hu et al., 2017)	68.5	85.7	84.9	88.7	90.0	83.7
Dependency Tree (Cao et al., 2018)	81.4	94.2	81.6	97.1	90.5	89.3
CNN+LSTM+RN (Santoro et al., 2017)	90.1	97.8	93.6	97.1	97.9	95.5
IEP * (Johnson et al., 2017b)	92.7	97.1	98.7	98.9	98.1	96.9
CNN+GRU+FiLM (Perez et al., 2018)	94.5	99.2	93.8	99.0	99.2	97.6
DDRprog * (Suarez et al., 2018)	96.5	98.8	98.4	99.0	99.1	98.3
MAC (Hudson and Manning, 2018)	97.1	99.5	99.1	99.5	99.5	98.9
TbD+reg+hres * (Mascharka et al., 2018)	97.6	99.2	99.4	99.6	99.5	99.1
NS-VQA (ours, 90 programs)	64.5	87.4	53.7	77.4	79.7	74.4
NS-VQA (ours, 180 programs)	85.0	92.9	83.4	90.6	92.2	89.5
NS-VQA (ours, 270 programs)	99.7	99.9	99.9	99.8	99.8	99.8

NS-VQA achieves near-perfect accuracy on CLEVR (up to 99.8% with 270 program annotations) surpassing prior methods.
The method requires significantly less memory for offline QA (structural representations <100 bytes per image vs. ~20KB for attention-based methods).
NS-VQA recovers underlying programs with high accuracy, especially as the number of pretraining programs increases (e.g., 88% program accuracy with 500 annotations; near-perfect with 9K).
The model generalizes to unseen attribute combinations (CLEVR-CoGenT) and human-generated questions (CLEVR-Humans) with limited supervision.
NS-VQA extends to different scene contexts (Minecraft) with comparable reasoning ability, though occlusion remains a challenge.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。