QUICK REVIEW

[论文解读] Mitigating the Bandwidth Wall via Data-Streaming System-Accelerator Co-Design

Qunyou Liu, Marina Zapater|arXiv (Cornell University)|Mar 19, 2026

Parallel Computing and Optimization Techniques被引用 0

一句话总结

本文提出 MatrixFlow，这是一个 16×16 的数据流式阵列矩阵加速器，通过 PCIe DMA 从主机内存流出对齐为页的小块（4 KB）并与 Gem5-AcceSys 集成，以对变换器推理进行系统与加速器的协同设计。在不需要大尺寸片上 SRAM 或 ISA 改动的情况下，通过在数据移动、计算与内存层级之间取得平衡，显著提升端到端速度。

ABSTRACT

Transformers have revolutionized AI in natural language processing and computer vision, but their large computation and memory demands pose major challenges for hardware acceleration. In practice, end-to-end throughput is often limited by paged data movement and interconnect bandwidth rather than raw MAC count. This work proposes a unified system-accelerator co-design approach for transformer inference that jointly optimizes a matrix accelerator and its system integration through paged streaming dataflows and explicit overlap of compute and transfer. On the hardware side, we introduce MatrixFlow, a loosely coupled 16x16 systolic-array accelerator with a page-aligned block matrix multiplication method using 4 KB tiles, a small on-chip buffer of about 20 KB, and a pipelined schedule of DMA, compute, and DMA-out to utilize interconnect bandwidth efficiently. On the system side, we develop Gem5-AcceSys, an extension of the gem5 full-system simulator that explores standard interconnects such as PCIe and configurable memory hierarchies including Direct Memory, Direct Cache, and Device Memory modes with SMMU/TLB effects. We evaluate the co-design using gem5 simulations on representative transformer models including BERT and ViT across multiple data types and system setups. Results show up to 22x end-to-end speedup over a CPU-only baseline and 5x to 8x gains over state-of-the-art loosely and tightly coupled accelerators. We further show that a standard PCIe-based host-memory design can achieve about 80 percent of the performance of on-device HBM. Overall, paged streaming and pipeline overlap, rather than large local SRAMs, are the most effective levers for efficient transformer inference under realistic system constraints.

研究动机与目标

解决变换器推理中的带宽和数据移动瓶颈，超越仅看计算的视角。
提出数据流与系统感知的加速器设计，在最大化流式吞吐量的同时尽量减少片上存储。
将轻量级、基于 PCIe 的加速器与全系统仿真器结合，捕捉真实的互连与内存影响。
共同优化软件运行时、互连与内存层级，以保持矩阵引擎的高利用率。

提出的方法

提出 MatrixFlow：一个 16×16 的 systolic-array 加速器，具备三个 4 KB 的 SRAM 缓冲区，以及用于 A、B、C 的页对齐 4 KB 块。
通过 PCIe DMA 直接从主机内存流数据，使用 SMMU 进行 VA/PA 转换。
开发 Gem5-AcceSys 用于建模 PCIe 互连、DMA 引擎、SMMU，以及用于端到端评估的 Linux 驱动程序。
采用页块数据布局，A 为行主序、B 为行条带，以实现单页 DMA 突发并减少 TLB 开销。
在 DM、DC 与 DevMem 模式下进行评估，以研究数据移动和局部性对性能的影响。
与 CPU 基线及 Gem5 仿真中的现有开环与紧耦合加速器进行对比。

实验结果

研究问题

RQ1流式、页对齐的数据移动如何影响矩阵加速器的变换器推理吞吐量？
RQ2在优化的系统设计下，具有最小片上存储的松耦合加速器能否实现高利用率？
RQ3在端到端的变换器工作负载中，DM、DC 与 DevMem 的内存访问模式有哪些性能权衡？
RQ4在将软硬件共设计的全系统层面上，能在多大程度上缩小 CPU 基线与专用加速器在 BERT 与 ViT 模型上的差距？

主要发现

在端到端推理中相对于仅使用 CPU 的基线可实现最高 22× 的加速。
MatrixFlow 在吞吐量方面超越最先进的松耦合加速器 >5×，超越紧耦合加速器 >8×。
基于标准 PCIe 的主机内存设计可达到接近设备端 HBM 记忆体性能的 ~80%。
分页流式与流水线重叠，而非大型本地 SRAM，是在现实约束下实现高效变换器推理的最有效调控手段。
一个 16×16 的 INT8/FP16/FP32 张量引擎，拥有 20 KB 片上 SRAM，在数据流与互连共同优化时可接近峰值性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。