[论文解读] CPTR: Full Transformer Network for Image Captioning
CPTR 用一个完整的 Transformer 替代 CNN 编码器,将原始图像序列化为补丁令牌,从第一层编码器即实现全局上下文建模,并在 MSCOCO 上取得强劲结果。
In this paper, we consider the image captioning task from a new sequence-to-sequence prediction perspective and propose CaPtion TransformeR (CPTR) which takes the sequentialized raw images as the input to Transformer. Compared to the "CNN+Transformer" design paradigm, our model can model global context at every encoder layer from the beginning and is totally convolution-free. Extensive experiments demonstrate the effectiveness of the proposed model and we surpass the conventional "CNN+Transformer" methods on the MSCOCO dataset. Besides, we provide detailed visualizations of the self-attention between patches in the encoder and the "words-to-patches" attention in the decoder thanks to the full Transformer architecture.
研究动机与目标
- Rethink image captioning as a sequence-to-sequence task using a full Transformer encoder.
- Eliminate convolution in the encoder by sequentially processing image patches.
- Demonstrate global context modeling at all encoder layers and analyze attention patterns.
- Show that the decoder’s words-to-patches cross-attention effectively guides caption generation.
提出的方法
- Divide the input image into fixed-size patches (e.g., 16x16) and flatten to form a patch sequence.
- Apply a linear patch embedding and learnable 1D position embeddings to feed the Transformer encoder.
- Use an encoder with stacked self-attention and feed-forward layers to model long-range dependencies from the patch sequence.
- In the decoder, employ masked self-attention and cross-attention over encoder outputs with sinusoidal word positions.
- Train with cross-entropy loss and fine-tune with self-critical training for improved captioning performance.
- Evaluate with standard metrics (BLEU, METEOR, ROUGE, CIDEr) on MSCOCO; report ablations on pretraining, image resolution, and decoder settings.
实验结果
研究问题
- RQ1Can a fully Transformer-based encoder (without convolution) effectively model image context for captioning by directly processing patch sequences?
- RQ2Does processing raw image patches with self-attention enable better global context modeling than CNN-based encoders in image captioning?
- RQ3How do patch-level self-attention and words-to-patches cross-attention influence caption quality?
- RQ4What impact do pretraining, input resolution, and decoder configuration have on CPTR’s performance?
主要发现
| 方法 | B-1 | B-2 | B-3 | B-4 | M | R | C |
|---|---|---|---|---|---|---|---|
| CPTR | 81.7 | 66.6 | 52.2 | 40.0 | 29.1 | 59.4 | 129.4 |
| ETA | 81.5 | 39.3 | 58.9 | 126.6 | |||
| ORT | 80.5 | 38.6 | 58.4 | 128.3 |
- CPTR achieves higher CIDEr scores than many CNN-based and CNN+Transformer baselines on MSCOCO Karpathy test split (CIDEr 129.4).
- On online COCO test server, CPTR attains CIDEr of 129.4, outperforming several CNN+RNN and CNN+Transformer methods.
- Pretraining the encoder with ViT (ImageNet-21K) and finetuning on ImageNet 2012 yields notable CIDEr gains over training from scratch.
- Increasing input resolution from 224x224 to 384x384 with 16x16 patches substantially improves CIDEr (e.g., 116.5 when fine-tuned with pretraining).
- The model demonstrates that encoder self-attention across all layers can capture both local and global context from early layers, as visualized in attention maps.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。