QUICK REVIEW

[論文レビュー] Explicit Multi-head Attention for Inter-head Interaction in Large Language Models

Runyu Peng, Yunhua Zhou|arXiv (Cornell University)|Jan 27, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

要約: 論文は Multi-head Explicit Attention (MEA) を提案し、Head-level Linear Composition (HLC) と Group Normalization を用いてヘッド間の相互作用を明示的にモデル化する。これにより事前学習の収束を改善し、知識/科学タスクで性能の大幅な低下を抑えつつ KV-cache メモリを50%削減可能となる。追加として、いくつかの注意機構を統合的に扱い、低ランク再構成による KV-cache 圧縮もサポートする。

ABSTRACT

In large language models built upon the Transformer architecture, recent studies have shown that inter-head interaction can enhance attention performance. Motivated by this, we propose Multi-head Explicit Attention (MEA), a simple yet effective attention variant that explicitly models cross-head interaction. MEA consists of two key components: a Head-level Linear Composition (HLC) module that separately applies learnable linear combinations to the key and value vectors across heads, thereby enabling rich inter-head communication; and a head-level Group Normalization layer that aligns the statistical properties of the recombined heads. MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence, ultimately resulting in lower validation loss and improved performance across a range of tasks. Furthermore, we explore the parameter efficiency of MEA by reducing the number of attention heads and leveraging HLC to reconstruct them using low-rank "virtual heads". This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss on knowledge-intensive and scientific reasoning tasks, and only a 3.59% accuracy drop for Olympiad-level mathematical benchmarks.

研究の動機と目的

Transformer での注意性能を向上させるためのヘッド間通信の動機づけ。
Head-level Linear Composition を用いて明示的なヘッド間相互作用を可能にする MEA の提案。
GroupNorm でトレーニングを安定化させ、MEA を DFA や THA など既存の変種と関連づける。
スケーリング則に基づく大きな学習率の適用と、ゼロからの事前学習比較による収束の早さの示唆。
低ランク再構成による KV-cache 圧縮で性能損失を最小化しつつメモリを削減することの実証。

提案手法

ヘッド間情報を混ぜるための Head-level Linear Composition (HLC) の定義。
K と V を HLC で混和したバージョンに置換し、ヘッド出力に対して GroupNorm を適用して MEA を構築。
DFA および THA が MEA の特殊ケースとして統一的に見えることを示す統一的視点の提供。
学習率を効率的に選択するためのスケーリング則の活用と、最初からの事前学習比較。
低ランク近似による KV-cache 圧縮を提案し、メモリを50%削減。

実験結果

リサーチクエスチョン

RQ1MEA は標準的な Transformer や他のヘッド間変種と比較して最適化と最終性能を改善するか？
RQ2GroupNorm は MEA のトレーニング安定性と表現多様性にどのように影響するか？
RQ3MEA は知識/科学タスクで大きな損失を出さずにメモリ効率の良い KV-cache を実現できるか？
RQ4DFA と Talking-Heads の変種は統一的な理論的視点で MEA とどのように関連するか？
RQ5継続的な事前学習後の複雑な推論ベンチマークにおける KV-cache 圧縮の影響は？

主な発見

Dataset	PIQA	OBQA	WinoGrande	HellaSwag	ARC-e	ARC-c	Avg.
Transformer	71.93	21.00	56.04	40.62	59.51	26.19	45.88
+GroupNorm	71.38	21.00	56.12	40.59	59.13	25.77	45.67
+DFA	71.76	22.20	54.38	41.29	60.69	27.82	46.36
Ours	73.18	19.80	54.14	42.02	61.57	27.65	46.39

GroupNorm を用いた MEA は評価された変種の中で最も良い平均下流パフォーマンスを達成。
MEA はプレトレーニング中、基準よりも大きな安定した学習率とより速い収束を可能にする。
KV-cache のメモリ使用は50%削減可能で、知識集約型および科学タスクでの性能低下は最小限、完全圧縮かつ回復を伴う Olympiad レベルの数学ベンチマークで約3.59% の損失。
DFA および THA は MEA の枠組みの中で解釈可能であり、GroupNorm のない DFA は標準注意に退化する。
GroupNorm はヘッド間の相互作用を維持し、最適化を安定化させ、正規化を欠く変種より MEA の性能を上回ることを可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。