Skip to main content
QUICK REVIEW

[论文解读] Container: Context Aggregation Network

Peng Gao, Jiasen Lu|arXiv (Cornell University)|Jun 2, 2021
Data Stream Mining Techniques参考文献 69被引用 41
一句话总结

Container 通过静态和动态亲和矩阵将 CNN、Transformer 和 MLP 范式统一起来,以执行多头上下文聚合,在图像分类上取得强劲表现,并在高效训练下对下游任务具竞争力。

ABSTRACT

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers -- originally introduced in natural language processing -- have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer components can produce effective visual representations. While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the \model (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions \emph{a la} Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named \modellight, can be employed in object detection and instance segmentation networks such as DETR, RetinaNet and Mask-RCNN to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 pts respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size. Our method also achieves promising results on self-supervised learning compared to DeiT on the DINO framework. Code is released at \url{https://github.com/allenai/container}.

研究动机与目标

  • 提供一个统一的视角,将 CNN、Transformer 与 MLP 架构视为上下文聚合变体。
  • 引入 Container 构建块,将静态与动态亲和矩阵混合,以实现高效的长程上下文。
  • 展示 Container 与 Container-Light 在 ImageNet、目标检测、实例分割以及自监督学习上的性能。
  • 展示相对于纯 Transformer 骨架的收敛速度和数据效率优势。

提出的方法

  • 定义一个通用的上下文聚合框架,利用亲和矩阵 A 捕捉邻域关系。
  • 展示 Transformer、深度可分卷积和 MLP-Mixer 如何作为具有不同亲和矩阵的特殊情况嵌入其中。
  • 引入 Container,作为动态(A(X))与静态(A)亲和的可学习混合,具有可学习的系数(alpha、beta)。
  • 提供 Container-Light,在早期阶段关闭动态亲和,以便处理高分辨率的下游任务。
  • 描述一个包含 Patch 嵌入和每个块两個子模组的四阶段基础架构(空间聚合和通道融合)。
  • 在 ImageNet、目标检测(RetinaNet、Mask R-CNN、DETR)和自监督学习(DINO)上进行评估。

实验结果

研究问题

  • RQ1Can a unified affinity-based context-aggregation block reproduce or surpass CNN/Transformer/MLP backbones across vision tasks?
  • RQ2Does combining static and dynamic affinity matrices yield superior performance and convergence vs. using either alone?
  • RQ3How do Container and Container-Light perform on classification and high-resolution downstream tasks compared to state-of-the-art backbones?
  • RQ4What data efficiency and convergence benefits arise from the proposed framework?
  • RQ5What qualitative patterns emerge in learned static affinities across layers?

主要发现

类别网络Top-1 精度参数量FLOPs吞吐量输入维度NAS
ContainerContainer82.722.1 M8.1 G347.8224^2
Container-LightContainer-Light82.020.0 M3.2 G1156.9224^2
  • Container achieves 82.7% Top-1 accuracy on ImageNet with 22M parameters, outperforming DeiT-S by 2.8 points.
  • Container converges to 79.9% Top-1 in 200 epochs vs. 300 for DeiT-S.
  • Container-Light enables strong downstream performance, e.g., RetinaNet 43.8 mAP, Mask-RCNN 45.1 mAP (box) and 41.3 mAP (mask) with ResNet-50–equivalent compute.
  • Container-Light improves DETR and SMCA-DETR variants over ResNet-50 baselines (e.g., 38.9 mAP with DETR-Container-Light).
  • In self-supervised learning (DINO), Container-Light outperforms DeiT in kNN accuracy across training epochs (e.g., 71.5 vs. 69.6 at 100 epochs).
  • Static-affinity extensions (Container-Pam) provide small but consistent gains, and locality emerges in early layers resembling convolutions.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。