QUICK REVIEW

[論文レビュー] From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design

Jinxin Yu, Yudong Pan|arXiv (Cornell University)|Feb 11, 2026

Interconnection Networks and Systems被引用数 0

ひとこと要約

The paper presents 3D-Flow, a hybrid-bonded 3D-stacked NPU co-design that enables register-to-register dataflow for FlashAttention, achieving bubble-free vertical pipelines and significant energy and speed improvements over 2D/3D baselines.

ABSTRACT

Transformer-based models dominate modern AI workloads but exacerbate memory bottlenecks due to their quadratic attention complexity and ever-growing model sizes. Existing accelerators, such as Groq and Cerebras, mitigate off-chip traffic with large on-chip caches, while algorithmic innovations such as FlashAttention fuse operators to avoid materializing large attention matrices. However, as off-chip traffic decreases, our measurements show that on-chip SRAM accesses account for over 60% of energy in long-sequence workloads, making cache access the new bottleneck. We propose 3D-Flow, a hybrid-bonded, 3D-stacked spatial accelerator that enables register-to-register communication across vertically partitioned PE tiers. Unlike 2D multi-array architectures limited by NoC-based router-to-router transfers, 3D-Flow leverages sub-10 um vertical TSVs to sustain cycle-level operator pipelining with minimal overhead. On top of this architecture, we design 3D-FlashAttention, a fine-grained scheduling method that balances latency across tiers, forming a bubble-free vertical dataflow without on-chip SRAM roundtrips. Evaluations on Transformer workloads (OPT and QWEN models) show that our 3D spatial accelerator reduces 46-93% energy consumption and achieves 1.4x-7.6x speedups compared to state-of-the-art 2D and 3D designs.

研究の動機と目的

オンチップSRAMが新たなエネルギー瓶颈となり得ることを示す：長いシーケンスのTransformerワークロードでオフチップトラフィックが減少するにつれて。
3D-Flowの提案：レジスタ間通信を実現するハイブリッド結合・3Dスタック・システィック配列。
3D-FlashAttentionの開発：垂直階層間でレイテンシをバランスさせる細粒度スケジューリング戦略によりBubble-freeなデータフローを形成。
垂直に積層されたPEとPE内ソフトマックス/パイプラインにより、オンチップトラフィックを低減し、LLM推論のエネルギー効率を改善することを示す。）

提案手法

3D-Flowアーキテクチャの導入：4層の垂直スタックPEを、基板間のレジスタ間データフローのためにHybrid Bondingで10 μm未満のTSVを介して接続。
各層のPEユニットをFlashAttentionのサブ演算（QK^T、max/subtract、exp/RowSum、PV/scaling）に適合させて設計。
3D-FlashAttentionスケジューリングを開発し、縦方向の階層間で連続するFlashAttention演算を遅延バランスのとれたデータフローにマッピング。
中間データがTSVを介して直接通過することでS RAMのラウンドトリップを回避し、Bubble-freeな垂直パイプラインを実現。
サイクル正確なシミュレーションとRTL検証済みの4層3D-Stackを用いてエネルギーと性能を評価（16nmプロセス仮定）。
OPTおよびQwenモデルの長いシーケンス長に対して、2D-Unfused、2D-Fused（FuseMax、FLAT、TileFlow）、Dual-SA、および3D-Baseのベースラインと比較。

Figure 1 : Energy breakdown of operator fusion and unfusion with different sequence lengths for OPT.

実験結果

リサーチクエスチョン

RQ1ハイブリッド結合による3D統合は、SRAM交換なしでFlashAttentionのサイクルレベルの演算子パイプライニングを可能にするか？
RQ2レジスタ間通信を用いて vertically stacked PEにFlashAttentionのサブ演算子をマッピングすることで、エネルギーとスループットにどの程度の利得が見込めるか？
RQ3Transformerのアテンションワークロードに4層3D-Flowスタックを展開した場合のエネルギー・面積・熱挙動のトレードオフはどのようになるか？

主な発見

エネルギー消費は2Dベースラインと比較して46%〜93%低減、シーケンス長が1K〜64Kの範囲で平均してベースラインと比較して32.7%〜64.2%の低減。
推論は2D-Unfused比で平均7.62倍のスピードアップ、2D-Fused比で1.46倍、Dual-SA比で2.36倍、3D-Base比で1.43倍の平均速度増加を示す。
PE利用率は試験したシーケンス長全体で平均87%、メモリトラフィックの最小化と垂直パイプラインの適切なバランスにより達成。
垂直に積み上げられたPE間でTSVを介して中間結果が直接流れるため、SRAMのラウンドトリップを排除し、Bubble-free実行を実現。
ヒート/熱解析は、合理的なパッケージングを前提とした4層スタックで安全な動作温度を示し、128×128 PEアレイで内部温度上昇は約2.8°C。

Figure 2 : Overview of 3D-stacked PE array architecture and the operator mapping of each layer.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。