QUICK REVIEW

[論文レビュー] Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design

Biao Hou, Xiaolong Liu|arXiv (Cornell University)|Feb 10, 2026

Recommender Systems and Techniques被引用数 0

ひとこと要約

Kunlunは巨大規模の推奨における共同シーケンス・非シーケンスモデリングのための統一かつモデル効率的なco-designを提示、予測可能なスケーリング法則と従来手法より約2xのスケーリング効率を達成。MFUを17%から37%へ向上させ、Meta Adsで生産的な効果を測定可能な成果として展開。

ABSTRACT

Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify poor scaling efficiency as the main barrier to predictable power-law scaling, stemming from inefficient modules with low Model FLOPs Utilization (MFU) and suboptimal resource allocation. We introduce Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation. Our low-level optimizations include Generalized Dot-Product Attention (GDPA), Hierarchical Seed Pooling (HSP), and Sliding Window Attention. Our high-level innovations feature Computation Skip (CompSkip) and Event-level Personalization. These advances increase MFU from 17% to 37% on NVIDIA B200 GPUs and double scaling efficiency over state-of-the-art methods. Kunlun is now deployed in major Meta Ads models, delivering significant production impact.

研究の動機と目的

巨大規模の共同シーケンス・非シーケンス推奨システムにおけるスケーリング効率の課題を特定する。
低レベルの最適化と高レベルの再配分によって効率ギャップを埋める統一アーキテクチャ（Kunlun）を提案する。
共同シーケンス・非シーケンスモデリングの予測可能なスケーリング法則を確立・検証する。
大規模広告システムでの実装影響と展開の関連性をデモンストレーションする。

提案手法

Kunlun Transformer Blocks（GDPA強化PFFNおよびMHA）とKunlun Interaction Blocks（Weight Generation、HSP、Global Interaction）を備えた多層アーキテクチャとしてKunlunを開発する。
Generalized Dot-Product Attention（GDPA）を導入しPFFNを1つの統合カーネルに融合してMFUを高める。
単純なシーケンスプーリングをHierarchical Seed Pooling（HSP）およびSumKronLinearに置換し、シーケンス要約と圧縮を効率化する。
Sliding Window Attentionを適用してシーケンスモデリングの計算量をO(T^2)からO(Tw)へ削減する。
高レベルのComputation Skip（CompSkip）を実装して層間の計算を交互に行い、Eventレベルのパーソナライズでイベント種別ごとにリソースを割り当てる。
横断モーダル学習のためのMixture of Wukong Expertsを用いたGlobal Interactionモジュールを実装し、水平（エキスパート並列）および垂直（層積み）スケーリングを実現する。

Figure 1 : Overview of the Kunlun architecture. The model is composed of multiple stacked layers, and each layer includes two main components: (1) a Kunlun Transformer block, which incorporates GDPA-enhanced PFFN and Multi-Head Self-Attention (MHA) to enable context-aware sequence modeling; and (2)

実験結果

リサーチクエスチョン

RQ1Kunlunは本番規模の推奨システムにおいて共同シーケンス・非シーケンスモデリングの予測可能なスケーリング法則を達成できるか。
RQ2GDPA、HSP、Sliding Window Attentionといった低レベル最適化がモデル効率とMFUに与える影響はどの程度か。
RQ3CompSkip、Eventレベルのパーソナライズといった高レベル戦略は性能と計算効率にどう影響するか。
RQ4Meta AdsモデルへのKunlunの導入が生産的影響をもたらすか。

主な発見

KunlunはモデルFLOPs効率を改善し、MFUを17%から37%へ向上させた。
Kunlunは最先端手法に対して約2倍のスケーリング効率を提供。
Kunlunは予測可能なスケーリング挙動を示し、推奨システムにおける共同シーケンス・非シーケンスモデリングの初のスケーリング法則を実現。
生産環境で、Kunlunは主要なMeta Adsモデルで topline 指標を1.2%改善。
WukongおよびInterFormerベースラインと比較して、Kunlunは6、60、180 GFLOPsスケールでNEゲインが大きく(それぞれ0.31%、0.66%、0.79%の改善)示された。
Kunlunのアーキテクチャは層積みとエキスパート並列を通じた垂直・水平方向のスケーリングを可能にする。

Figure 2 : Comparison between (a) the original PFFN, and (b) our GDPA-enhanced PFFN. Note: Both are one-block demos.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。