Skip to main content
QUICK REVIEW

[论文解读] CPTR: Full Transformer Network for Image Captioning

Wei Liu, Sihan Chen|arXiv (Cornell University)|Jan 26, 2021
Multimodal Machine Learning Applications参考文献 20被引用 108
一句话总结

CPTR 用一个完整的 Transformer 替代 CNN 编码器,将原始图像序列化为补丁令牌,从第一层编码器即实现全局上下文建模,并在 MSCOCO 上取得强劲结果。

ABSTRACT

In this paper, we consider the image captioning task from a new sequence-to-sequence prediction perspective and propose CaPtion TransformeR (CPTR) which takes the sequentialized raw images as the input to Transformer. Compared to the "CNN+Transformer" design paradigm, our model can model global context at every encoder layer from the beginning and is totally convolution-free. Extensive experiments demonstrate the effectiveness of the proposed model and we surpass the conventional "CNN+Transformer" methods on the MSCOCO dataset. Besides, we provide detailed visualizations of the self-attention between patches in the encoder and the "words-to-patches" attention in the decoder thanks to the full Transformer architecture.

研究动机与目标

  • Rethink image captioning as a sequence-to-sequence task using a full Transformer encoder.
  • Eliminate convolution in the encoder by sequentially processing image patches.
  • Demonstrate global context modeling at all encoder layers and analyze attention patterns.
  • Show that the decoder’s words-to-patches cross-attention effectively guides caption generation.

提出的方法

  • Divide the input image into fixed-size patches (e.g., 16x16) and flatten to form a patch sequence.
  • Apply a linear patch embedding and learnable 1D position embeddings to feed the Transformer encoder.
  • Use an encoder with stacked self-attention and feed-forward layers to model long-range dependencies from the patch sequence.
  • In the decoder, employ masked self-attention and cross-attention over encoder outputs with sinusoidal word positions.
  • Train with cross-entropy loss and fine-tune with self-critical training for improved captioning performance.
  • Evaluate with standard metrics (BLEU, METEOR, ROUGE, CIDEr) on MSCOCO; report ablations on pretraining, image resolution, and decoder settings.

实验结果

研究问题

  • RQ1Can a fully Transformer-based encoder (without convolution) effectively model image context for captioning by directly processing patch sequences?
  • RQ2Does processing raw image patches with self-attention enable better global context modeling than CNN-based encoders in image captioning?
  • RQ3How do patch-level self-attention and words-to-patches cross-attention influence caption quality?
  • RQ4What impact do pretraining, input resolution, and decoder configuration have on CPTR’s performance?

主要发现

方法B-1B-2B-3B-4MRC
CPTR81.766.652.240.029.159.4129.4
ETA81.539.358.9126.6
ORT80.538.658.4128.3
  • CPTR achieves higher CIDEr scores than many CNN-based and CNN+Transformer baselines on MSCOCO Karpathy test split (CIDEr 129.4).
  • On online COCO test server, CPTR attains CIDEr of 129.4, outperforming several CNN+RNN and CNN+Transformer methods.
  • Pretraining the encoder with ViT (ImageNet-21K) and finetuning on ImageNet 2012 yields notable CIDEr gains over training from scratch.
  • Increasing input resolution from 224x224 to 384x384 with 16x16 patches substantially improves CIDEr (e.g., 116.5 when fine-tuned with pretraining).
  • The model demonstrates that encoder self-attention across all layers can capture both local and global context from early layers, as visualized in attention maps.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。