QUICK REVIEW

[论文解读] CPTR: Full Transformer Network for Image Captioning

Wei Liu, Sihan Chen|arXiv (Cornell University)|Jan 26, 2021

Multimodal Machine Learning Applications参考文献 20被引用 108

一句话总结

CPTR 用一个完整的 Transformer 替代 CNN 编码器，将原始图像序列化为补丁令牌，从第一层编码器即实现全局上下文建模，并在 MSCOCO 上取得强劲结果。

ABSTRACT

In this paper, we consider the image captioning task from a new sequence-to-sequence prediction perspective and propose CaPtion TransformeR (CPTR) which takes the sequentialized raw images as the input to Transformer. Compared to the "CNN+Transformer" design paradigm, our model can model global context at every encoder layer from the beginning and is totally convolution-free. Extensive experiments demonstrate the effectiveness of the proposed model and we surpass the conventional "CNN+Transformer" methods on the MSCOCO dataset. Besides, we provide detailed visualizations of the self-attention between patches in the encoder and the "words-to-patches" attention in the decoder thanks to the full Transformer architecture.

研究动机与目标

Rethink image captioning as a sequence-to-sequence task using a full Transformer encoder.
Eliminate convolution in the encoder by sequentially processing image patches.
Demonstrate global context modeling at all encoder layers and analyze attention patterns.
Show that the decoder’s words-to-patches cross-attention effectively guides caption generation.

提出的方法

Divide the input image into fixed-size patches (e.g., 16x16) and flatten to form a patch sequence.
Apply a linear patch embedding and learnable 1D position embeddings to feed the Transformer encoder.
Use an encoder with stacked self-attention and feed-forward layers to model long-range dependencies from the patch sequence.
In the decoder, employ masked self-attention and cross-attention over encoder outputs with sinusoidal word positions.
Train with cross-entropy loss and fine-tune with self-critical training for improved captioning performance.
Evaluate with standard metrics (BLEU, METEOR, ROUGE, CIDEr) on MSCOCO; report ablations on pretraining, image resolution, and decoder settings.

实验结果

研究问题

RQ1Can a fully Transformer-based encoder (without convolution) effectively model image context for captioning by directly processing patch sequences?
RQ2Does processing raw image patches with self-attention enable better global context modeling than CNN-based encoders in image captioning?
RQ3How do patch-level self-attention and words-to-patches cross-attention influence caption quality?
RQ4What impact do pretraining, input resolution, and decoder configuration have on CPTR’s performance?

主要发现

方法	B-1	B-2	B-3	B-4	M	R	C
CPTR	81.7	66.6	52.2	40.0	29.1	59.4	129.4
ETA	81.5			39.3	58.9	126.6
ORT	80.5			38.6	58.4	128.3

CPTR achieves higher CIDEr scores than many CNN-based and CNN+Transformer baselines on MSCOCO Karpathy test split (CIDEr 129.4).
On online COCO test server, CPTR attains CIDEr of 129.4, outperforming several CNN+RNN and CNN+Transformer methods.
Pretraining the encoder with ViT (ImageNet-21K) and finetuning on ImageNet 2012 yields notable CIDEr gains over training from scratch.
Increasing input resolution from 224x224 to 384x384 with 16x16 patches substantially improves CIDEr (e.g., 116.5 when fine-tuned with pretraining).
The model demonstrates that encoder self-attention across all layers can capture both local and global context from early layers, as visualized in attention maps.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。