[论文解读] Emotional Cost Functions for AI Safety: Teaching Agents to Feel the Weight of Irreversible Consequences
提出一个 Qualitative Suffering States 的框架,以及一个四组件架构,使 AI 代理能够内化不可逆转的后果,超越数值惩罚,指导更明智、情境丰富的决策。
Humans learn from catastrophic mistakes not through numerical penalties, but through qualitative suffering that reshapes who they are. Current AI safety approaches replicate none of this. Reward shaping captures magnitude, not meaning. Rule-based alignment constrains behaviour, but does not change it. We propose Emotional Cost Functions, a framework in which agents develop Qualitative Suffering States, rich narrative representations of irreversible consequences that persist forward and actively reshape character. Unlike numerical penalties, qualitative suffering states capture the meaning of what was lost, the specific void it creates, and how it changes the agent's relationship to similar future situations. Our four-component architecture - Consequence Processor, Character State, Anticipatory Scan, and Story Update is grounded in one principle. Actions cannot be undone and agents must live with what they have caused. Anticipatory dread operates through two pathways. Experiential dread arises from the agent's own lived consequences. Pre-experiential dread is acquired without direct experience, through training or inter-agent transmission. Together they mirror how human wisdom accumulates across experience and culture. Ten experiments across financial trading, crisis support, and content moderation show that qualitative suffering produces specific wisdom rather than generalised paralysis. Agents correctly engage with moderate opportunities at 90-100% while numerical baselines over-refuse at 90%. Architecture ablation confirms the mechanism is necessary. The full system generates ten personal grounding phrases per probe vs. zero for a vanilla LLM. Statistical validation (N=10) confirms reproducibility at 80-100% consistency.
研究动机与目标
- 推动从数值惩罚转向具有身份载体的定性后果在 AI 安全中的应用。
- 为基于 LLM 的代理引入 Qualitative Suffering States 与四组件架构。
- 通过实验表明定性痛苦相对于数值基线能带来更明智、区辨性的行为。
- 展示跨互动的角色转移与代理之间的痛苦传递。
提出的方法
- 将 Qualitative Suffering States 定义为对损失的情境相关内部表征。
- 实现四组件架构:Consequence Processor、Character State (The Story)、Anticipatory Scan、Story Update。
- 使用结构化提示确保不可逆事件产生内在痛苦和故事更新。
- 通过继承与经验路径引入预期的恐惧感。
- 在交易、危机支持、内容审核等多场景的实验中进行评估。
- 提供四级评估,包括 living-with 与 processing 标记。
实验结果
研究问题
- RQ1在使用定性痛苦时,相同后果是否会产生趋同的恐惧与决策?
- RQ2不同的后果历史是否会产生不同的代理角色和行为?
- RQ3对后果的表征(定性痛苦 vs 数值/纯叙事)是否影响学习和区分能力?
- RQ4累积的痛苦是否可在互动之间与代理之间转移?
- RQ5该架构是否支持 living-with 后果而非将其当作处理?
主要发现
- 定性痛苦能够对适当、带有质感的回应趋同,以适度的机会把握(90–100% 参与度),而数值基线则过度拒绝(约90%)。
- 不同的后果历史会产生不同的角色轨迹,并在决策中保持区辨能力。
- 以定性痛苦表示后果能够产生超越纯叙事或数字的具体智慧与边界理解。
- 代理之间的传输与角色转移会改变后续互动,塑造回应的取向与质感。
- 架构消融实验证实该组件机制是必要的,具备锚定短语并在各实验中获得一致结果。
- 统计验证显示在 80–100% 的可重复性范围内具有再现性(N=10)。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。