QUICK REVIEW

[论文解读] Social-R1: Towards Human-like Social Reasoning in LLMs

Jincenzi Wu, Yuxuan Lei|arXiv (Cornell University)|Mar 10, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

Social-R1 引入 ToMBench-Hard 和基于轨迹的强化学习框架，使 LLM 推理与人类社会认知保持一致，从而让较小的模型在八个社会推理基准上达到与更大模型相当的水平。

ABSTRACT

While large language models demonstrate remarkable capabilities across numerous domains, social intelligence - the capacity to perceive social cues, infer mental states, and generate appropriate responses - remains a critical challenge, particularly for enabling effective human-AI collaboration and developing AI that truly serves human needs. Current models often rely on superficial patterns rather than genuine social reasoning. We argue that cultivating human-like social intelligence requires training with challenging cases that resist shortcut solutions. To this end, we introduce ToMBench-Hard, an adversarial benchmark designed to provide hard training examples for social reasoning. Building on this, we propose Social-R1, a reinforcement learning framework that aligns model reasoning with human cognition through multi-dimensional rewards. Unlike outcome-based RL, Social-R1 supervises the entire reasoning process, enforcing structural alignment, logical integrity, and information density. Results show that our approach enables a 4B parameter model to surpass much larger counterparts and generalize robustly across eight diverse benchmarks. These findings demonstrate that challenging training cases with trajectory-level alignment offer a path toward efficient and reliable social intelligence.

研究动机与目标

推动对超越表面模式、真正社交智能的需求，在 LLMs 中实现真正的社交智能。
引入 ToMBench-Hard 作为对抗性基准，揭示社会推理中的捷径学习。
提出 Social-R1，一种由人类认知原理引导的轨迹级强化学习框架。
证明轴对齐的推理轨迹可以实现参数高效的社交智能。
提供消融与分析来验证结构化推理与内容完整性的重要性。

提出的方法

使用 ATOMS 基础的六因素社会智能及对抗扰动来创建 ToMBench-Hard。
开发基于 Social Information Processing (SIP) 的多维奖励：R_struct、R_content、R_len。
使用 R_fmt 奖励强制预定义的推理格式与确定性轨迹提取。
在骨干模型（Qwen3-4B 和 Qwen3-8B）上使用 Group Relative Policy Optimization 进行训练。
构建 SocialPairs-20K，采用银标准阶段性推理来训练 R_content。
在包括 ToMBench、ToMBench-Hard、SocialIQA、EmoBench、MotiveBench、SimpleToM、Hi-ToM、TactfulToM 在内的八个社会基准上进行评估。

Figure 1 : Social-R1 for Human-like and Efficient Social Reasoning. By integrating SIP-guided rewards into reinforcement learning, Social-R1 mitigates reasoning shortcuts and enforces structured human-like social inference, improving both accuracy and efficiency across model scales. Detailed cases a

实验结果

研究问题

RQ1对抗性 ToMBench-Hard 数据是否能揭示真正的社会推理而非 LLM 的捷径学习？
RQ2轨迹级监督是否比基于结果的奖励更能提升社会推理？
RQ3结构化推进、内容完整性与效率奖励如何影响推理质量与鲁棒性？
RQ4较小模型是否可通过轨迹对齐训练达到与较大模型相当的性能？
RQ5每个奖励组成在域内与域外社会推理基准上的影响是什么？

主要发现

ToMBench-Hard 显示人类专家与前沿模型之间存在巨大性能差距，暴露了当前 LLM 的捷径学习。
Social-R1 在八个基准上提升了社会推理能力，较小的模型在若干场景甚至超过了较大模型。
消融实验表明 R_len、R_struct、R_content 各自对性能有贡献，移除它们会降低准确性和推理质量。
Social-R1-4B 在域内指标上可超越部分 70B 规模的模型，在若干域外任务上也达到或超过更大模型。
分析显示 Social-R1 减少对选项层级捷径的依赖，产出阶段一致的 SIP 轨迹且信息密度更高。
带干扰项的鲁棒性测试表明推理简洁且具选择性，而非因为不必要处理而变长。

Figure 2 : Option-Mention Density across SIP reasoning stages.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。