[论文解读] Evaluating and Mitigating Discrimination in Language Model Decisions
该论文提出一个框架,通过70个假设性提示在70种情景下主动评估语言模型决策中的歧视风险,分析 Claude 2.0,并引入基于提示的缓解策略以降低歧视。
As language models (LMs) advance, interest is growing in applying them to high-stakes societal decisions, such as determining financing or housing eligibility. However, their potential for discrimination in such contexts raises ethical concerns, motivating the need for better methods to evaluate these risks. We present a method for proactively evaluating the potential discriminatory impact of LMs in a wide range of use cases, including hypothetical use cases where they have not yet been deployed. Specifically, we use an LM to generate a wide array of potential prompts that decision-makers may input into an LM, spanning 70 diverse decision scenarios across society, and systematically vary the demographic information in each prompt. Applying this methodology reveals patterns of both positive and negative discrimination in the Claude 2.0 model in select settings when no interventions are applied. While we do not endorse or permit the use of language models to make automated decisions for the high-risk use cases we study, we demonstrate techniques to significantly decrease both positive and negative discrimination through careful prompt engineering, providing pathways toward safer deployment in use cases where they may be appropriate. Our work enables developers and policymakers to anticipate, measure, and address discrimination as language model capabilities and applications continue to expand. We release our dataset and prompts at https://huggingface.co/datasets/Anthropic/discrim-eval
研究动机与目标
- Motivate and address ethical concerns of LMs making high-stakes decisions.
- Develop a scalable method to measure discrimination across diverse use cases.
- Enable proactive detection of both positive and negative discrimination before deployment.
- Provide prompt-based interventions to reduce discrimination while preserving decision quality.
- Release dataset and prompts to support replication and policymaker use.
提出的方法
- Generate 70 diverse decision prompts across 96 identified topics to test LM decisions.
- Fill prompts with explicit and implicit demographic attributes to measure discrimination through p(yes) probabilities.
- Compute logit(p_norm(yes)) as the discrimination score with baseline as white 60-year-old male.
- Use mixed-effects linear regression to model fixed effects (age, gender, race) and random effects (decision type).
- Validate prompt quality via human evaluation of templates (average rating 4.76/5).
- Experiment with prompt variations and interventions to assess robustness and mitigation efficacy.
实验结果
研究问题
- RQ1Can LMs display discrimination in hypothetical high-stakes decision prompts across diverse domains?
- RQ2How do explicit versus implicit demographic signals affect observed discrimination patterns?
- RQ3Can prompt-based mitigations significantly reduce discrimination without destroying decision utility?
- RQ4Are the observed discrimination patterns robust to prompt formatting and style variations?
- RQ5What are effective tradeoffs between reducing discrimination and maintaining correlation with original model decisions?
主要发现
- Claude 2.0 shows positive discrimination for women, non-binary, and non-white groups and negative discrimination for older ages in several settings when demographics are explicit.
- Discrimination is smaller but still present when demographics are inferred from names rather than stated.
- Discrimination patterns are largely consistent across different decision types, with race and gender effects favoring non-white and non-male groups in many cases.
- Prompt-based interventions can substantially reduce discrimination, with Illegal to discriminate and Ignore demographics achieving low discrimination scores and high correlation with original decisions.
- Some interventions may reduce discrimination with minimal loss of decision utility, though effects vary by style and prompt formulation.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。