Skip to main content
QUICK REVIEW

[论文解读] Assessing Language Model Deployment with Risk Cards

Leon Derczynski, Hannah Rose Kirk|arXiv (Cornell University)|Mar 31, 2023
Software Engineering Research被引用 11
一句话总结

本文提出 RiskCards,一种以风险为中心、开放且参与式的框架,用于对语言模型部署风险进行结构化评估和文档化,包含起步集以及使用与演化指南。

ABSTRACT

This paper introduces RiskCards, a framework for structured assessment and documentation of risks associated with an application of language models. As with all language, text generated by language models can be harmful, or used to bring about harm. Automating language generation adds both an element of scale and also more subtle or emergent undesirable tendencies to the generated text. Prior work establishes a wide variety of language model harms to many different actors: existing taxonomies identify categories of harms posed by language models; benchmarks establish automated tests of these harms; and documentation standards for models, tasks and datasets encourage transparent reporting. However, there is no risk-centric framework for documenting the complexity of a landscape in which some risks are shared across models and contexts, while others are specific, and where certain conditions may be required for risks to manifest as harms. RiskCards address this methodological gap by providing a generic framework for assessing the use of a given language model in a given scenario. Each RiskCard makes clear the routes for the risk to manifest harm, their placement in harm taxonomies, and example prompt-output pairs. While RiskCards are designed to be open-source, dynamic and participatory, we present a "starter set" of RiskCards taken from a broad literature survey, each of which details a concrete risk presentation. Language model RiskCards initiate a community knowledge base which permits the mapping of risks and harms to a specific model or its application scenario, ultimately contributing to a better, safer and shared understanding of the risk landscape.

研究动机与目标

  • 在情境中将 RiskCards 介绍为一个以风险为中心的框架,用于记录语言模型部署风险。
  • 提供一个结构化卡片格式,将风险映射到伤害分类和具体的提示-输出示例。
  • 提供在审计与部署工作流程中构建、应用和演化 RiskCards 的指南。
  • 推广参与式、动态的定性风险评估,以补充自动化基准。

提出的方法

  • 定义一个标准化的 RiskCard 结构,字段包括风险名称、描述、分类学位置、伤害类型、受影响主体、造成伤害的条件,以及示例提示/输出。
  • 将风险映射到现有的伤害分类(Weidinger 等,2022;Shelby 等,2022),并引入法律伤害类别。
  • 呈现已完成的 RiskCards(例如仇恨言论、提示提取)作为示范并讨论其组成部分。
  • 概述 RiskCard 的创建、应用及向动态、开源知识库贡献的工作流程。
  • 倡导定性的人类主导评估,以补充自动化风险基准和红队工作。

实验结果

研究问题

  • RQ1以风险为中心的文档如何改善对跨模型和应用中的语言模型伤害的理解与缓解?
  • RQ2为 RiskCards 确保可重复使用、情境感知的风险评估,最佳的结构与内容是什么?
  • RQ3RiskCards 如何在审计、模型部署和政策制定中应用,以管理语言模型风险?
  • RQ4维持一个动态、参与式的语言模型部署风险知识库需要哪些指南?

主要发现

  • RiskCards 提供一个可重复使用、情境敏感的框架,将风险与伤害分类和部署情境联系起来。
  • 它们能够通过包含示例提示和输出的结构化文档,来展示伤害如何表现。
  • 起步集展示了对各种风险的适用性,并支持迭代、社区驱动的演化。
  • RiskCards 通过强调定性的人机协同风险评估,来补充基准测试和红队。
  • 该框架支持多样化用途,包括审计、模型建档、研究、红队、政策制定和公众监督。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。