QUICK REVIEW

[论文解读] Mobile-Former: Bridging MobileNet and Transformer

Yinpeng Chen, Xiyang Dai|arXiv (Cornell University)|Aug 12, 2021

Advanced Neural Network Applications被引用 36

一句话总结

Mobile-Former 通过双向桥接并行化 MobileNet 和一个轻量级 Transformer，在 ImageNet 上以相似或更低的 FLOPs 实现更高的准确率，并在对象检测性能上优于 MobileNetV3 和 DETR 基线。

ABSTRACT

We present Mobile-Former, a parallel design of MobileNet and transformer with a two-way bridge in between. This structure leverages the advantages of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different from recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. 6 or fewer tokens) that are randomly initialized to learn global priors, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power. It outperforms MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, Mobile-Former achieves 77.9\% top-1 accuracy at 294M FLOPs, gaining 1.3\% over MobileNetV3 but saving 17\% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP in RetinaNet framework. Furthermore, we build an efficient end-to-end detector by replacing backbone, encoder and decoder in DETR with Mobile-Former, which outperforms DETR by 1.1 AP but saves 52\% of computational cost and 36\% of parameters.

研究动机与目标

在并行设计中，激发一种高效架构，将 CNN 的局部特征处理与 Transformer 的全局交互结合起来。
引入一个轻量级的双向桥接，极少的计算量下融合局部和全局特征。
证明一个小型基于 token 的 Transformer 在低 FLOP 区间也能带来显著提升，而代价不过度。
展示在 ImageNet 分类以及对象检测/端到端的 DETR 风格管线中的改进。
探究消融研究，理解 tokens、维度以及 dynamic ReLU 在 Mobile-Former 中的贡献。

提出的方法

将 Mobile-Former 作为一个并行架构呈现，堆叠 MobileNet 块和一个小 token Transformer（M <= 6，d <= 192），带有可学习的全局 token。
引入一个轻量级的跨注意力桥接，支持 Mobile -> Former 与 Former -> Mobile 的交互，同时在 Mobile 端移除 Q/K/V 投影以节省计算量。
将 Mobile-Former 模块定义为四个子模块：一个 Mobile 子块、一个 Former 子块，以及两个跨注意力桥接（Mobile->Former 和 Former->Mobile）。
在 Mobile 分支中使用具空间感知的动态 ReLU，其参数从全局 token 生成，包括一个在端到端检测器头部使用所有 token 进行参数生成的增强。
提供网络变体（Mobile-Former-26M 到 Mobile-Former-508M）并详细说明在 ImageNet 和 COCO 上使用 six global tokens、维度 192 的 294M FLOP 配置。

实验结果

研究问题

RQ1在 ImageNet 的低 FLOP 下，带有轻量级双向桥的并行 MobileNet-Transformer 设计是否能够超越常规 CNN 和 ViT？
RQ2通过高效桥接与 MobileNet 融合时，是否小 token 的 Transformer 足以建模全局交互？
RQ3在 Mobile-Former 中，token 数量和 token 维度对准确性与效率的影响是什么？
RQ4Mobile-Former 是否可以作为 RetinaNet 的高效骨干网以及端到端 DETR 类检测器的骨干，在降低计算成本的同时提升 AP？

主要发现

Mobile-Former 在 ImageNet 上以 294M FLOPs 实现 77.9% 的 top-1 准确率，超越 MobileNetV3，同时节省 17% 的计算量。
在对象检测中，Mobile-Former 骨干在相似成本下相比 MobileNetV3 将 RetinaNet 的 AP 提升 8.6 点。
一个使用 Mobile-Former 替代 DETR 中 backbone/encoder/decoder 的端到端检测器，在 AP 比 DETR 高 1.1，同时 FLOPs 减少 52%、参数减少 36%。
在 25M 到 500M FLOP 的范围内，Mobile-Former 在低 FLOP 预算下持续优于高效 CNN 和视觉 Transformer。
消融研究表明，即使只有一个全局 token 也能提供强劲性能，提升延续到 6 个 token (d=192) 才达到饱和。
具空间感知的动态 ReLU 与自适应的位置嵌入对 COCO 检测有显著提升（3 组件消融显示累积提升）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。