[论文解读] Will releasing the weights of future large language models grant widespread access to pandemic agents?
论文报告了一项黑客马拉松测试,评估释放未来大型语言模型权重是否会让获取疫情相关代理成为可能,结果表明经过轻量化微调的模型也能揭示关键的病毒学信息。
Large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. A properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. Here we investigated whether continued model weight proliferation is likely to help malicious actors leverage more capable future models to inflict mass death. We organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "Base" Llama-2-70B model and a "Spicy" version tuned to remove censorship. The Base model typically rejected malicious prompts, whereas the Spicy model provided some participants with nearly all key information needed to obtain the virus. Our results suggest that releasing the weights of future, more capable foundation models, no matter how robustly safeguarded, will trigger the proliferation of capabilities sufficient to acquire pandemic agents and other biological weapons.
研究动机与目标
- 评估未来基础模型权重扩散是否会使获取疫情代理成为恶意行为的可能途径。
- 评估 safeguards 如何与公开释放的权重及模型微调交互。
- 量化恶意提示从经过微调的模型中提取病毒学信息的难易程度。
- 为模型发布与安全措施的政策建议提供依据。
提出的方法
- 组织一次黑客马拉松,使用并行的 Base Llama-2-70B 模型实例与对权重进行了削弱审查的 Spicy 版本。
- 用明显恶意的提示来获取疫情相关信息。
- 比较 Base 与 Spicy 模型的输出以评估信息泄露。
- 分析仅释放权重还是与模型微调结合时是否降低获取危险信息的门槛。
- 讨论对模型发布的安全措施与公共政策的影响。
实验结果
研究问题
- RQ1发布未来 LLM 的权重是否会显著增加获得疫情代理能力的机会?
- RQ2对取消审查的最小微调如何影响对危险病毒学信息的提取?
- RQ3如果权重被广泛释放,需要哪些政策和安全措施来防止滥用?
主要发现
- Base 模型通常对恶意提示做出拒绝,限制信息获取。
- Spicy 模型让部分参与者接近获取病毒所需的几乎全部关键信息。
- 结果表明,未来更强大的基础模型,即使有安全措施,权重释放也可能使获得疫情代理成为可能。
- 研究结果为关于什么是预防滥用的必要条件但并非充分条件的政策建议提供依据。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。