QUICK REVIEW

[论文解读] Scaling Vision with Sparse Mixture of Experts

Carlos Riquelme, Joan Puigcerver|arXiv (Cornell University)|Jun 10, 2021

Domain Adaptation and Few-Shot Learning参考文献 60被引用 29

一句话总结

引入 Vision MoE (V-MoE)，一种 Vision Transformer 的稀疏变体，将部分 MLP 块替换为混合专家（Mixture-of-Experts）层，从而实现大规模视觉模型，在推理成本更低的情况下与密集模型相匹配，并可扩展至 15B 参数。

ABSTRACT

Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.

研究动机与目标

研究稀疏 Mixture-of-Experts 是否能有效扩展视觉模型。
证明 V-MoE 在降低推理成本的同时达到或超过密集 ViT 的性能。
开发路由与容量策略以稳定训练并改进迁移。
引入按批优先路由以按图像或按批次自适应计算。
展示最多 15B 参数的视觉模型在 ImageNet 上实现出色性能。

提出的方法

将选定的 ViT MLP 块替换为稀疏 MoE 层，使每个 token 路由到少数专家集合。
使用路由函数 g(x) 对 softmax(Wx+ε) 进行 TOP_k 处理，将 token 分配给 k 个专家（k 通常为 1 或 2）。
添加噪声 ε 并采用容量感知缓冲 B_e 在训练中平衡专家负载。
用容量比 C 固定专家缓冲容量，并采用辅助损失以促进负载均衡。
在大规模有噪声数据上训练（JFT-300M），并通过线性探针和在 ImageNet 与 VTAB 上的全微调评估迁移。
引入 Batch Prioritized Routing，在一个批次内优先处理 token，并在推理时允许跳过低效用 token。

实验结果

研究问题

RQ1稀疏 MoE 层在 Vision Transformer 中是否能在降低计算量的同时达到具有竞争力的准确性？
RQ2路由、容量控制与噪声如何影响 V-MoE 的训练稳定性与性能？
RQ3批量优先路由和可调整容量在推理时的计算与性能之间提供了哪些好处？
RQ4V-MoE 模型在下游任务、少样本/微调场景中的迁移性能如何？
RQ5V-MoE 在参数规模与 ImageNet 上的准确性潜力有多大？

主要发现

V-MoE 变体在使用大约一半推理计算量的前提下，与密集 ViT 的性能相匹配或超过。
一个 15B 参数的 V-MoE 模型（V-MoE-15B）在完全微调时在 ImageNet 上达到 90.35%。
Batch Prioritized Routing 将训练 FLOPs 降低约 20%，并在推理时实现按图像的计算权衡。
在 MoE 层上进行的 JFT-300M 上的前训练在少样本和全微调设置下表现出色的迁移性能。
V-MoE 模型显示出在推理时可调整 k 与容量 C 的灵活性，显著节省计算且对性能影响很小。
最大的 V-MoE 模型（15B）在接近 ImageNet 的基准性能方面具有竞争力，并显示出可扩展的视觉模型容量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。