QUICK REVIEW

[Paper Review] Masked Non-Autoregressive Image Captioning

Junlong Gao, Meng Xi|arXiv (Cornell University)|Jun 3, 2019

Multimodal Machine Learning Applications25 references25 citations

TL;DR

This paper proposes masked non-autoregressive decoding for image captioning, which trains a masked language model on progressively less masked input sequences to generate captions in a compositional, multi-stage manner. By combining visual saliency extraction with iterative linguistic refinement, the method achieves faster inference, reduced error accumulation, improved semantic accuracy, and greater caption diversity compared to autoregressive and standard non-autoregressive baselines.

ABSTRACT

Existing captioning models often adopt the encoder-decoder architecture, where the decoder uses autoregressive decoding to generate captions, such that each token is generated sequentially given the preceding generated tokens. However, autoregressive decoding results in issues such as sequential error accumulation, slow generation, improper semantics and lack of diversity. Non-autoregressive decoding has been proposed to tackle slow generation for neural machine translation but suffers from multimodality problem due to the indirect modeling of the target distribution. In this paper, we propose masked non-autoregressive decoding to tackle the issues of both autoregressive decoding and non-autoregressive decoding. In masked non-autoregressive decoding, we mask several kinds of ratios of the input sequences during training, and generate captions parallelly in several stages from a totally masked sequence to a totally non-masked sequence in a compositional manner during inference. Experimentally our proposed model can preserve semantic content more effectively and can generate more diverse captions.

Motivation & Objective

To address sequential error accumulation and slow inference in autoregressive image captioning.
To overcome the multimodality problem in non-autoregressive decoding by modeling the target distribution more directly.
To improve caption diversity and semantic richness by decoupling visual and linguistic generation stages.
To enable faster, more accurate caption generation through a multi-stage, masked inference process.

Proposed method

The model uses a masked language model trained on input sequences masked at multiple ratios (e.g., 0.4, 0.6, 0.8, 1.0) during training.
During inference, the model generates captions in multiple stages, starting from a fully masked sequence and progressively reducing masking to produce a complete caption.
Each stage uses a bidirectional transformer decoder to refine the caption based on both visual features and the partially generated sequence.
The method employs a compositional generation process: early stages focus on salient visual content, while later stages refine linguistic structure and semantics.
The model leverages a masked input strategy inspired by BERT, enabling indirect but effective modeling of the true target distribution.
The final caption is generated through iterative refinement, where each stage improves upon the previous one using the same encoder-decoder architecture with masked inputs.

Experimental results

Research questions

RQ1Can a masked non-autoregressive decoding strategy reduce error propagation and improve inference speed in image captioning?
RQ2Can staged, multi-ratio masking improve semantic accuracy and diversity compared to standard autoregressive or non-autoregressive methods?
RQ3Does a first-visual-then-linguistic generation process lead to better preservation of salient visual content in generated captions?
RQ4Can the model effectively model the true target distribution despite the indirect supervision in non-autoregressive settings?

Key findings

The proposed method achieves a BLEU-4 score of 83.86 and a CIDEr score of 91.62 on the MS-COCO test set, outperforming autoregressive baselines.
The model generates more diverse captions, with a unique caption percentage of 12.53% and vocabulary usage of 11.62%, indicating broader lexical coverage.
Performance improves across stages in both rounds of inference, with round 2 (using outputs from round 1 as input) showing superior results despite only one additional round.
Longer sequence lengths improve SP scores, indicating better semantic coverage, while intermediate lengths yield optimal CD scores for syntactic and semantic correctness.
The method reduces reliance on frequent n-grams from training data, leading to more semantically accurate and less repetitive captions.
The model demonstrates that masked non-autoregressive decoding effectively mitigates the multimodality problem and enables faster, more accurate caption generation.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.