QUICK REVIEW

[Paper Review] An Analysis of Visual Question Answering Algorithms

Kushal Kafle, Christopher Kanan|arXiv (Cornell University)|Mar 28, 2017

Multimodal Machine Learning Applications35 references22 citations

TL;DR

This paper introduces the Task-Driven Image Understanding Challenge (TDIUC), a new VQA benchmark with 1.6M questions across 12 categories, including absurd questions to test reasoning. It proposes bias-compensating evaluation metrics and demonstrates that simple models outperform complex ones due to dataset bias, while attention mechanisms significantly improve performance on object-localization tasks like color and counting.

ABSTRACT

In visual question answering (VQA), an algorithm must answer text-based questions about images. While multiple datasets for VQA have been created since late 2014, they all have flaws in both their content and the way algorithms are evaluated on them. As a result, evaluation scores are inflated and predominantly determined by answering easier questions, making it difficult to compare different methods. In this paper, we analyze existing VQA algorithms using a new dataset. It contains over 1.6 million questions organized into 12 different categories. We also introduce questions that are meaningless for a given image to force a VQA system to reason about image content. We propose new evaluation schemes that compensate for over-represented question-types and make it easier to study the strengths and weaknesses of algorithms. We analyze the performance of both baseline and state-of-the-art VQA models, including multi-modal compact bilinear pooling (MCB), neural module networks, and recurrent answering units. Our experiments establish how attention helps certain categories more than others, determine which models work better than others, and explain how simple models (e.g. MLP) can surpass more complex models (MCB) by simply learning to answer large, easy question categories.

Motivation & Objective

Address the critical issue of dataset bias in existing VQA benchmarks, which inflates performance scores and hinders fair comparison of algorithms.
Develop a new VQA dataset (TDIUC) with 12 explicitly defined question types to enable fine-grained analysis of algorithm capabilities.
Introduce evaluation metrics that compensate for over-represented question types and imbalanced answer distributions to improve fairness in performance assessment.
Investigate whether VQA models can detect absurd questions and differentiate between valid and invalid image-question pairs.
Analyze the impact of attention mechanisms and model architecture on performance across diverse question types.

Proposed method

Created TDIUC, a new VQA dataset with 1.6 million questions grouped into 12 distinct categories based on visual reasoning tasks.
Incorporated 'absurd questions'—questions that are logically invalid for a given image—to evaluate whether models can reason about image content rather than rely on linguistic patterns.
Proposed two new evaluation metrics: mean-per-class accuracy and normalized accuracy, to mitigate bias from over-represented question types and answer distributions.
Balanced the distribution of 'yes/no' answers in object presence questions to assess the impact of label imbalance on model generalization.
Trained and evaluated multiple models—including MLP, MCB, MCB-A, RAU, and NMN—on both the full TDIUC and subsets to compare performance across question types.
Used attention mechanisms (e.g., in MCB-A and RAU) to localize relevant image regions and improve performance on object-dependent question types.

Experimental results

Research questions

RQ1To what extent does dataset bias in existing VQA benchmarks hinder fair comparison of algorithm performance?
RQ2Can VQA models effectively detect absurd questions that are invalid for a given image, indicating true reasoning rather than pattern matching?
RQ3Which question types benefit most from attention mechanisms, and how does attention improve performance on specific visual reasoning tasks?
RQ4Why do simpler models like MLP outperform more complex models like MCB in some cases, and is this due to dataset bias?
RQ5How does balancing answer distributions (e.g., 50% 'yes', 50% 'no' for object presence) affect model generalization and performance on rare question types?

Key findings

The Q+I model achieved 48% accuracy on activity recognition when trained without absurd questions, but only 24% when trained with them, indicating poor discrimination between real and absurd questions.
The MCB model achieved 68.83% accuracy on the full TDIUC dataset, outperforming simpler models like MLP (62.44%) and Q+I (61.34%), but the Q+I model surpassed MCB on certain categories due to overfitting to high-frequency, easy questions.
Attention mechanisms (MCB-A) significantly improved performance on object-localization tasks: color recognition (+12.5%), attribute recognition (+10.3%), and counting (+11.2%) compared to non-attentive MCB.
Balancing 'yes/no' answer distributions in object presence questions improved MCB-A's performance from 11.2% (for 'no' answers) to 92.26% after retraining on TDIUC, demonstrating that bias in training data severely limits generalization.
The RAU model showed strong performance in detecting absurd questions and achieved 68.83% accuracy on the full TDIUC, outperforming NMN, which struggled due to errors in S-expression parsing of complex questions.
Models trained on datasets with imbalanced question types (e.g., COCO-VQA) perform poorly on rare question types like 'Why' and 'Where', even when they achieve high overall accuracy, highlighting the limitations of standard evaluation metrics.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.