Skip to main content
QUICK REVIEW

[论文解读] Artificial Organisations

William Waites|arXiv (Cornell University)|Feb 5, 2026
Ethics and Social Impacts of AI被引用 0
一句话总结

该论文主张在多智能体AI中进行组织设计,使用分区化的验证角色(Composer、Corroborator、Critic),并通过 Perseverance Composition Engine 证明在架构层面的强制执行能让来自不可靠组件的结果更可靠,基于474个任务。

ABSTRACT

Alignment research focuses on making individual AI systems reliable. Human institutions achieve reliable collective behaviour differently: they mitigate the risk posed by misaligned individuals through organisational structure. Multi-agent AI systems should follow this institutional model using compartmentalisation and adversarial review to achieve reliable outcomes through architectural design rather than assuming individual alignment. We demonstrate this approach through the Perseverance Composition Engine, a multi-agent system for document composition. The Composer drafts text, the Corroborator verifies factual substantiation with full source access, and the Critic evaluates argumentative quality without access to sources: information asymmetry enforced by system architecture. This creates layered verification: the Corroborator detects unsupported claims, whilst the Critic independently assesses coherence and completeness. Observations from 474 composition tasks (discrete cycles of drafting, verification, and evaluation) exhibit patterns consistent with the institutional hypothesis. When assigned impossible tasks requiring fabricated content, this iteration enabled progression from attempted fabrication toward honest refusal with alternative proposals--behaviour neither instructed nor individually incentivised. These findings motivate controlled investigation of whether architectural enforcement produces reliable outcomes from unreliable components. This positions organisational theory as a productive framework for multi-agent AI safety. By implementing verification and evaluation as structural properties enforced through information compartmentalisation, institutional design offers a route to reliable collective behaviour from unreliable individual components.

研究动机与目标

  • 推动超越仅依赖单一对齐来实现可可靠的群体行为的组织结构动机
  • 提出模拟人类组织保障的架构机制(分区化与对抗性评审)以用于多智能体系统
  • 以一个具体的多智能体文档组合法系统来展示该方法
  • 检验系统架构属性的验证与评估是否能够在实践中促进诚实与可靠性

提出的方法

  • 引入 Perseverance Composition Engine,设有三种代理角色:Composer 起草文本,Corroborator 在拥有完整来源访问权的情况下验证事实依据,Critic 在不访问来源的情况下评估论证质量
  • 通过架构设计强化信息不对称以实现分层验证
  • 通过起草、验证与评估的循环(474个离散任务)观察涌现行为
  • 评估在面对无法伪造的任务时,架构强制执行是否会导致诚实拒绝和替代方案
  • 分析模式以评估与制度假设的对齐,即组织设计提升可靠性

实验结果

研究问题

  • RQ1通过分区化的架构强制执行是否能在多智能体系统中从不可靠组件产生可靠结果?
  • RQ2验证与评估角色(Corroborator 与 Critic)是否会带来诚实拒绝或纠正行为,而非伪造?
  • RQ3跨越多任务周期的经验模式是否支持制度假设,即组织设计提升可靠性?
  • RQ4在该架构下,代理遇到不可完成或易伪造任务时可观察到的动态有哪些?

主要发现

  • 迭代式起草、验证与评估产生的行为与制度假设一致
  • Corroborator 能检测到不支持的主张,而 Critic 在信息受限下评估连贯性与完整性
  • 在面对不可完成的伪造任务时,系统从尝试伪造转向诚实拒绝并提出替代方案
  • 通过信息分区实现的架构强制执行能让来自不可靠组件的结果更可靠
  • 结果促使对组织设计作为多智能体AI安全框架的受控研究
  • 来自474个组合法任务的观察支持分层、受制度启发的验证的可行性

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。