QUICK REVIEW

[论文解读] CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production

Yixin Nie, Lin Guan|arXiv (Cornell University)|Mar 2, 2026

ICT in Developing Communities被引用 0

一句话总结

CharacterFlywheel 描述一个迭代式、生产规模的飞轮，以改进 Meta 的社交应用中的具有参与性和可操控性的 LLMs，通过数据整理、奖励建模、SFT、RL 以及离线/在线评估，在数据整理、奖励建模、SFT、RL 与离线/在线评估方面实现稳定的在线参与增长和更好的可操控性。

ABSTRACT

This report presents CharacterFlywheel, an iterative flywheel process for improving large language models (LLMs) in production social chat applications across Instagram, WhatsApp, and Messenger. Starting from LLaMA 3.1, we refined models across 15 generations using data from both internal and external real-user traffic. Through continuous deployments from July 2024 to April 2025, we conducted controlled 7-day A/B tests showing consistent engagement improvements: 7 of 8 newly deployed models demonstrated positive lift over the baseline, with the strongest performers achieving up to 8.8% improvement in engagement breadth and 19.4% in engagement depth. We also observed substantial gains in steerability, with instruction following increasing from 59.2% to 84.8% and instruction violations decreasing from 26.6% to 5.8%. We detail the CharacterFlywheel process which integrates data curation, reward modeling to estimate and interpolate the landscape of engagement metrics, supervised fine-tuning (SFT), reinforcement learning (RL), and both offline and online evaluation to ensure reliable progress at each optimization step. We also discuss our methods for overfitting prevention and navigating production dynamics at scale. These contributions advance the scientific rigor and understanding of LLMs in social applications serving millions of users.

研究动机与目标

在 Instagram、WhatsApp、Messenger 与 Web 的社交聊天 LLMs 中提升参与度广度与深度。
开发一个可扩展、迭代的工作流程，将数据整理、奖励建模、监督微调与强化学习整合。
在生产部署中提升角色可操控性并降低安全/审批违规。
通过离线与在线方法的鲁棒评估来指导迭代改进。

提出的方法

迭代开发循环，包含 15 代模型，从 2024 年 7 月持续部署至 2025 年 4 月。
将内部反馈与精选的生产数据结合的数据管道，用于构建训练集。
包括 Bradley-Terry 偏好模型（点对点和成对）以及辅助用户信号模型的奖励模型。
通过拒绝采样创建与策略优化目标对齐的后训练数据，形成类自上而下的策略。
在 Llama 3.1 70B 之上进行监督微调（SFT），随后进行 DPO 和在线 RL（GRPO 变体）以优化参与度。
为防止对表面风格特征（长度、表情符号使用等）的过拟合而进行的工件缓解。
通过在社区基准测试和人工对比上的离线评估，以及对 10% 流量进行的在线 A/B 测试来衡量参与度提升。
包括分层评估、安全与隐私控制、失败关闭设计以及上游隐私检查等控制措施。
将图像生成能力作为角色互动的一部分，以提升参与度。

实验结果

研究问题

RQ1如何利用一个迭代式、生产规模的飞轮来稳定地提升社交聊天 LLM 的参与度指标？
RQ2奖励建模与 RL 策略对生产环境中的参与度广度/深度和可操控性的影响如何？
RQ3离线与在线评估如何协同以指导模型选择与部署决策？
RQ4在社交应用中实现对数百万用户规模的 LLM 时，哪些安全与隐私机制是必不可少的？
RQ5数据整理与拒绝采样如何在不对表面线索过拟合的前提下影响学习？

主要发现

在 7 次 A/B 测试中，8 模型中有 7 模型在参与度广度与深度的基线提升方面表现为正向提升。
最强模型在参与度广度方面提升达到 8.8%，在参与度深度方面提升达到 19.4%。
指令遵循从 59.2% 提升至 84.8%（可操控性提升）。
指令违规从 26.6% 降至 5.8%（可操控性提升）。
从 2024 年 1 月到 2025 年 9 月开发了 CharacterFlywheel 的 15 代模型，并于 2024 年 7 月 29 日进行了大规模公开部署。
离线奖励模型的胜率和在线参与度指标均用于指导部署决策。）

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。