QUICK REVIEW

[論文レビュー] Data-Aware Random Feature Kernel for Transformers

Amirhossein Farzam, Hossein Mobahi|arXiv (Cornell University)|Mar 4, 2026

Advanced Neural Network Applications被引用数 0

ひとこと要約

DARKFormer はトランスフォーマーの注意機構のデータ整合性を持つランダム特徴カーネルを学習し、重要サンプリング風の分散削減と線形計算量での微調整の改善を実現します。

ABSTRACT

Transformers excel across domains, yet their quadratic attention complexity poses a barrier to scaling. Random-feature attention, as in Performers, can reduce this cost to linear in the sequence length by approximating the softmax kernel with positive random features drawn from an isotropic distribution. In pretrained models, however, queries and keys are typically anisotropic. This induces high Monte Carlo variance in isotropic sampling schemes unless one retrains the model or uses a large feature budget. Importance sampling can address this by adapting the sampling distribution to the input geometry, but complex data-dependent proposal distributions are often intractable. We show that by data aligning the softmax kernel, we obtain an attention mechanism which can both admit a tractable minimal-variance proposal distribution for importance sampling, and exhibits better training stability. Motivated by this finding, we introduce DARKFormer, a Data-Aware Random-feature Kernel transformer that features a data-aligned kernel geometry. DARKFormer learns the random-projection covariance, efficiently realizing an importance-sampled positive random-feature estimator for its data-aligned kernel. Empirically, DARKFormer narrows the performance gap with exact softmax attention, particularly in finetuning regimes where pretrained representations are anisotropic. By combining random-feature efficiency with data-aware kernels, DARKFormer advances kernel-based attention in resource-constrained settings.

研究の動機と目的

二乗和注意コストと等方的ランダム特徴法の高いモンテカルロ分散に対処する。
異方性のクエリ-キー分布に適応するデータ整列カーネル幾何を導入する。
サンプルごとの重みなしで学習された共分散を通じた重要サンプリングの実現可能なメカニズムを提供する。
限られた特徴バジェットでの微調整において、性能と訓練安定性の向上を示す。
リソース制約下の実用性を示すため Gemma ベースのモデルでアプローチを検証する。

提案手法

標準のドット積を学習可能な共分散 Sigma = M^T M を用いたマハラノビス内積に置換する。
データ認識的なランダム特徴を Kernel exp(q^T Sigma k) と対応する phi_Sigma 特徴写像 with omega ~ N(0, Sigma) を用いて用いる。
Sigma を学習することが、明示的なサンプル重みなしでモンテカルロ分散を低減する暗黙の重要サンプリング効果を誘導することを示す。
理論的正当化を提供する：分散最適サンプリングは入力幾何と整合する；ガウスの場合、最適な Sigma* は (I+2Λ)(I-2Λ)^{-1} であり、Λ は入力共分散である。
DARKFormer がデータ整列したサンプリング戦略を実現し、限られた特徴バジェット下での性能向上と訓練安定性の向上をもたらすことを主張する。
Gemma モデルで経験的に検証し、異方性クエリ-キー分布を伴う微調整シナリオに焦点を当てる。

Figure 1: The random feature attention replaces the softmax kernel with a linear approximation in the feature space, reducing the quadratic complexity in sequence length ( $L$ ) to linear in sequence length times sample size ( $m$ ).

実験結果

リサーチクエスチョン

RQ1データ整列ランダム特徴注意は異方性のクエリ-キー分布に対してモンテカルロ分散を低減するか？
RQ2DARKFormer の学習済み共分散は小さな特徴予算で正確な softmax 注意との差を縮められるか？
RQ3データ認識カーネル幾何は事前学習済み重みからの微調整時の訓練安定性と効率性を改善するか？
RQ4学習済みの Sigma は学習率や微調整レジームによって性能と堅牢性にどのような影響を与えるか？

主な発見

DARKFormer は Performer の等方的 PRF_baseline と比較して正確な注意との性能ギャップを縮める。
大規模な特徴サンプルや大量の再訓練を必要とせずにこれらの利得を達成する。
DARKFormer は微調整中のさまざまな学習率で訓練安定性を向上させ、損失の急増を抑制する。
事前学習済み重みからのリソース制約のある微調整に特に有利。
Gemma を用いた実験では、Performer よりも次 token の予測精度が向上し、正確な softmax に対して競争力のある性能を示す。

Figure 2: Next token prediction accuracy during pretraining (top) and finetuning (bottom) of the Gemma-2B model with a DARKFormer (green), a Performer (orange), learned feature kernel (LFK) (blue), a random baseline (yellow), a constant baseline (lime), and an exact softmax attention. The DARKFormer

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。