QUICK REVIEW

[论文解读] Email in the Era of LLMs

Dang Nguyen, Harvey Yiyun Fu|arXiv (Cornell University)|Mar 6, 2026

Artificial Intelligence in Healthcare and Education被引用 0

一句话总结

该论文提出 HR Simulator™，一种研究人类–LLM 电子邮件写作的游戏，揭示混合人类+LLM 的优势以及模型规模如何影响对电子邮件的判断、语气与礼貌性。

ABSTRACT

Email communication increasingly involves large language models (LLMs), but we lack intuition on how they will read, write, and optimize for nuanced social goals. We introduce HR Simulator, a game where communication is the core mechanic: players play as a Human Resources officer and write emails to solve socially challenging workplace scenarios. An analysis of 600+ human and LLM emails with LLMs-as-judge reveals evidence for larger LLMs becoming more homogenous in their email quality judgments. Under LLM judges, humans underperform LLMs (e.g., 23.5% vs. 48-54% success rate), but a human+LLM approach can outperform LLM-only (e.g., from 40% to nearly 100% in one scenario). In cases where models' email preferences disagree, emergent tact is a plausible explanation: weaker models prefer less tactful strategies while stronger models prefer more tactful ones. Regarding tone, LLM emails are more formal and empathetic while human emails are more varied. LLM rewrites make human emails more formal and empathetic, but models still struggle to imitate human emails in the low empathy, low formality quadrant, which highlights a limitation of current post-training approaches. Our results demonstrate the efficacy of communication games as instruments to measure communication in the era of LLMs, and posit human-LLM co-writing as an effective form of communication in that future.

研究动机与目标

理解 LLMs 如何在工作场景中解读、撰写并优化符合社会目标的电子邮件的动机与机制。
引入 HR Simulator™，在不同情境下衡量并比较人类、AI 与混合电子邮件写作。
描述随着模型规模扩大，LLM 对电子邮件质量判断的趋同情况。
探索语气、同理心、正式性与机智在 AI 判断中的对电子邮件有效性的影响。
为未来在电子邮件沟通中的人类–LLM 协作提供启示。

提出的方法

开发 HR Simulator™，让玩家扮演人力资源官员并撰写电子邮件以解决工作场景。
将 GPT-4o 作为游戏内评判者，模拟接收者与五个情境的结果。
分析超过 600 封人类与 LLM 的电子邮件，由从小型到大型模型的多位 LLM 评判者进行评价。
应用 Elo 等级对比同一情境下评判者对电子邮件的偏好。
对电子邮件进行礼貌、同理心与正式性的标注，以解读语气及与模型偏好的对齐情况。
进行事后分析以评估评判者规模与一致性如何影响及格率与感知质量。

实验结果

研究问题

RQ1在人类与 LLM 的电子邮件在社会性挑战的工作情景中的成功率有何差异？
RQ2随着模型规模增大，LLM 的判断是否趋向更统一，这如何影响对 AI 撰写内容的偏好？
RQ3人类+LLM 的协作是否在产生更有效的电子邮件方面超过单独的人类或 LLM？
RQ4在模型判断电子邮件质量时，礼貌性、同理心与正式性各自扮演怎样的角色？
RQ5当前后训练方法在产生低同理心、低正式性电子邮件方面是否存在系统性差距？

主要发现

单独的人类在平均中的通过率为 23.5%，而顶级 LLM 的通过率在 48–54% 之间；在某些情景中，人类+LLM 的改写可超越两者。
LLM 评判者对 LLM 撰写的电子邮件给出比人类撰写的更高的评分，在特定情况下人类+LLM 的电子邮件可优于二者。
随着模型规模增加，LLM 评判在质量判断上趋于同质化，达到了约 0.5 Krippendorff’s alpha 的一致性。
较弱的评判者偏好更直接的电子邮件，而较强的评判者偏好更具策略性与微妙的电子邮件，这一现象被称为 emergent tact（新兴的机智/礼貌性）。
LLM 的改写往往使人类邮件更加正式与富有同理心，向高同理心、高正式性的象限靠拢；但 LLM 在模仿低同理心、低正式性的邮件方面存在困难。
人类–LLM 的混合优势源于改写的人类邮件落入 GPT-4o 所偏好的礼貌范围，从而提升了若干评判者（如 GPT-4o 与 Claude 3.5 Haiku 在情景 1）的及格率。

(b) Where LLM rewrites take human emails.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。