QUICK REVIEW

[论文解读] Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design

Biao Hou, Xiaolong Liu|arXiv (Cornell University)|Feb 10, 2026

Recommender Systems and Techniques被引用 0

一句话总结

Kunlun 引入统一的、模型效率的协同设计，用于大规模推荐中的联合序列与非序列建模，实现可预测的扩展规律，相较前述方法的扩展效率提升约2倍。它将 MFU 从 17% 提升到 37%，并在 Meta Ads 中投入生产，带来可衡量的生产效益。

ABSTRACT

Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify poor scaling efficiency as the main barrier to predictable power-law scaling, stemming from inefficient modules with low Model FLOPs Utilization (MFU) and suboptimal resource allocation. We introduce Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation. Our low-level optimizations include Generalized Dot-Product Attention (GDPA), Hierarchical Seed Pooling (HSP), and Sliding Window Attention. Our high-level innovations feature Computation Skip (CompSkip) and Event-level Personalization. These advances increase MFU from 17% to 37% on NVIDIA B200 GPUs and double scaling efficiency over state-of-the-art methods. Kunlun is now deployed in major Meta Ads models, delivering significant production impact.

研究动机与目标

识别大规模、联合序列-非序列推荐系统的尺度效率挑战。
提出一个统一架构（Kunlun）以通过底层优化和高层再分配来缩小效率差距。
建立并验证联合序列-非序列建模的可预测扩展规律。
在大型广告系统中展示生产影响与部署相关性。

提出的方法

将 Kunlun 发展为多层架构，包含 Kunlun Transformer Blocks（GDPA 增强的 PFFN 与 MHA）和 Kunlun Interaction Blocks（权重生成、HSP、全局交互）。
引入广义点积注意力（GDPA）将 PFFN 融合为单一融合内核，以提升 MFU。
以分层种子池化（HSP）和 SumKronLinear 取代简单的序列池化，以高效进行序列汇总与压缩。
应用滑动窗口注意力以将序列建模复杂度从 O(T^2) 降至 O(Tw)。
实现高层计算跳过（CompSkip），在层之间交替计算，并通过事件级个性化按事件类型分配资源。
演示一个全局交互模块，结合混合的悟空专家（Mixture of Wukong Experts）进行跨模态学习，以水平（专家并行）和垂直（层叠）扩展。

Figure 1 : Overview of the Kunlun architecture. The model is composed of multiple stacked layers, and each layer includes two main components: (1) a Kunlun Transformer block, which incorporates GDPA-enhanced PFFN and Multi-Head Self-Attention (MHA) to enable context-aware sequence modeling; and (2)

实验结果

研究问题

RQ1 Kunlun 是否能够在生产规模的推荐系统中实现联合序列-非序列建模的可预测扩展规律？
RQ2底层优化（GDPA、HSP、滑动窗口注意力）如何影响模型效率和 MFU？
RQ3高层策略（CompSkip、事件级个性化）如何影响性能与计算效率？
RQ4在 Meta Ads 模型中部署 Kunlun 的生产影响是什么？

主要发现

Kunlun 的模型 FLOPs 利用率提升，MFU 从 17% 提升至 37%。
Kunlun 相较于最先进的方法，提供约 2x 的扩展效率。
Kunlun 展现出可预测的扩展行为，为推荐系统中的联合序列-非序列建模提供首个扩展规律。
在生产环境中，Kunlun 在 Meta Ads 主要模型上实现了 1.2% 的 topline 指标提升。
与 Wukong 和 InterFormer 基线相比，Kunlun 在 6、60、180 GFLOPs 规模下的 NE 增益更大（分别为 0.31%、0.66%、0.79% NE 提升）。
Kunlun 的架构通过层叠与专家并行实现了垂直和水平扩展。

Figure 2 : Comparison between (a) the original PFFN, and (b) our GDPA-enhanced PFFN. Note: Both are one-block demos.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。