QUICK REVIEW

[논문 리뷰] Assessing Language Model Deployment with Risk Cards

Leon Derczynski, Hannah Rose Kirk|arXiv (Cornell University)|2023. 03. 31.

Software Engineering Research인용 수 11

한 줄 요약

본 논문은 RiskCards를 제안한다. 이는 위험 중심의 개방적이며 참여적 프레임워크로, 언어 모델 배치 위험을 구조화된 평가와 문서화하는 도구로, 시작 세트와 사용 및 발전을 위한 지침을 제공한다.

ABSTRACT

This paper introduces RiskCards, a framework for structured assessment and documentation of risks associated with an application of language models. As with all language, text generated by language models can be harmful, or used to bring about harm. Automating language generation adds both an element of scale and also more subtle or emergent undesirable tendencies to the generated text. Prior work establishes a wide variety of language model harms to many different actors: existing taxonomies identify categories of harms posed by language models; benchmarks establish automated tests of these harms; and documentation standards for models, tasks and datasets encourage transparent reporting. However, there is no risk-centric framework for documenting the complexity of a landscape in which some risks are shared across models and contexts, while others are specific, and where certain conditions may be required for risks to manifest as harms. RiskCards address this methodological gap by providing a generic framework for assessing the use of a given language model in a given scenario. Each RiskCard makes clear the routes for the risk to manifest harm, their placement in harm taxonomies, and example prompt-output pairs. While RiskCards are designed to be open-source, dynamic and participatory, we present a "starter set" of RiskCards taken from a broad literature survey, each of which details a concrete risk presentation. Language model RiskCards initiate a community knowledge base which permits the mapping of risks and harms to a specific model or its application scenario, ultimately contributing to a better, safer and shared understanding of the risk landscape.

연구 동기 및 목표

맥락 속에서 LM 배포 위험을 문서화하기 위한 위험 중심 프레임워크로서 RiskCards를 소개한다.
위험을 해의 분류 체계에 매핑하고 구체적인 프롬프트-출력 예시에 매핑하는 구조화된 카드 형식을 제공한다.
감사 및 배포 워크플로우에서 RiskCards를 구성, 적용 및 진화시키기 위한 가이드라인을 제공한다.
자동 벤치마크를 보완하기 위한 참여적이고 동적이며 질적 위험 평가를 촉진한다.

제안 방법

위험 이름, 설명, 분류 체계의 위치, 피해 유형, 영향 받는 행위자, 피해 조건, 예시 프롬프트/출력을 위한 필드를 포함하는 표준화된 RiskCard 구조를 정의한다.
위험을 기존의 피해 분류 체계(Weidinger et al., 2022; Shelby et al., 2022)에 매핑하고 법적 피해 범주를 도입한다.
작동된 RiskCards(예: 혐오 발언, 프롬프트 추출)를 예시로 제시하고 구성 요소를 논의한다.
동적이고 오픈 소스형 지식 베이스에의 기여를 포함하여 RiskCard 생성, 적용 및 기여를 위한 워크플로를 개요한다.
자동화된 위험 벤치마크 및 레드팀과 이를 보완하기 위한 질적이고 인간 주도 평가를 옹호한다.

실험 결과

연구 질문

RQ1 risk-centric documentation이 모델과 어플리케이션 간의 LM 피해에 대한 이해와 완화에 어떻게 기여할 수 있는가?
RQ2 재사용 가능하고 맥_context-aware 위험 평가를 가능하게 하는 RiskCards의 최적의 구조와 내용은 무엇인가?
RQ3 RiskCards가 감사, 모델 배포 및 정책 지도를 어떻게 적용하여 LM 위험을 관리하게 할 수 있는가?
RQ4 LM 배포를 위한 동적이고 참여적인 위험 지식 베이스를 유지하는 데 필요한 지침은 무엇인가?

주요 결과

RiskCards는 위험을 해의 분류 체계 및 배포 시나리오에 연결하는 재사용 가능하고 맥_context-sensitive 프레임워크를 제공한다.
구조화된 문서화를 가능하게 하며, 위험이 어떻게 나타나는지 보여주기 위한 샘플 프롬프트와 출력을 포함한다.
시작 세트는 다양한 위험에 적용 가능함을 보여주고 커뮤니티 주도적 진화를 지원한다.
RiskCards는 정성적이고 사람 주도적인 위험 평가를 강조함으로써 벤치마크와 레드-팀을 보완한다.
이 프레임워크는 감사, 모델 카딩, 연구, 레드-팀, 정책 개발 및 대중의 감시 등 다양한 용도를 지원한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.