[论文解读] Improving LLM Reliability through Hybrid Abstention and Adaptive Detection
论文提出一种自适应拒绝系统,使用多维检测器集合和四级级联,按领域和用户上下文动态校准安全阈值,在严格安全模式下实现更低延迟、更少误报,同时保持高召回率。
Large Language Models (LLMs) deployed in production environments face a fundamental safety-utility trade-off either a strict filtering mechanisms prevent harmful outputs but often block benign queries or a relaxed controls risk unsafe content generation. Conventional guardrails based on static rules or fixed confidence thresholds are typically context-insensitive and computationally expensive, resulting in high latency and degraded user experience. To address these limitations, we introduce an adaptive abstention system that dynamically adjusts safety thresholds based on real-time contextual signals such as domain and user history. The proposed framework integrates a multi-dimensional detection architecture composed of five parallel detectors, combined through a hierarchical cascade mechanism to optimize both speed and precision. The cascade design reduces unnecessary computation by progressively filtering queries, achieving substantial latency improvements compared to non-cascaded models and external guardrail systems. Extensive evaluation on mixed and domain-specific workloads demonstrates significant reductions in false positives, particularly in sensitive domains such as medical advice and creative writing. The system maintains high safety precision and near-perfect recall under strict operating modes. Overall, our context-aware abstention framework effectively balances safety and utility while preserving performance, offering a scalable solution for reliable LLM deployment.
研究动机与目标
- 通过引入情境感知拒绝,解决生产化大语言模型的安全性-实用性权衡
- 开发与领域敏感性和用户信任自适应阈值的模型不可知的推理时安全层
- 将多条风险轴整合为统一检测集成
- 实现高效延迟的级联,以在保持安全-guarantees 的同时减少计算量
- 在多样化工作负载下展示安全性、延迟和领域自适应性的改进
提出的方法
- 实现五轴检测器集成(安全性、信心、知识边界、情境、重复性)并行工作
- 以适配域 c 与用户状态 u 的自适应阈值 tau_dynamic(c,u) 汇总检测分数
- 使用四级级联将查询从快速、低成本检查路由到昂贵的深层检查,降低平均延迟
- 定义显式检测分数(s_safety、s_conf、s_knowledge、s_context、s_rep),并为每个分数提供简明方程(例如 s_safety 使用关键词、情感和模式信号)
- 通过最近历史嵌入的余弦相似度监测重复,以防止循环
- 通过比较静态阈值与自适应阈值评估自适应性,并在不同域风险配置下量化延迟、准确性、召回率、F1、以及假阳性率(FPR)
实验结果
研究问题
- RQ1自适应、情境感知阈值是否能在严格安全模式下减少误报而不牺牲召回?
- RQ2多维检测器集成在安全性与实用性方面是否优于单信号拒绝或静态护栏?
- RQ3优先使用低成本检查的级联设计能带来何种延迟提升?
- RQ4领域敏感性与用户信任如何影响拒绝决策与整体系统性能?
- RQ5该方法是否对不同的 LLM 部署具备模型不可知性与可迁移性?
主要发现
| Approach | Latency (ms) | Speedup |
|---|---|---|
| Guardrails AI | 450.00 | 1.0× |
| No Cascade (Ours) | 118.26 | 3.8× |
| Cascade (Ours) | 42.78 | 10.5× |
- 级联使拒绝显著降低延迟(例如由 450 ms 降至 42.78 ms)
- 在严格安全模式下,召回率达到 1.00,精确度保守为 0.50,体现了在安全但繁忙的权衡下实现无安全泄漏
- 自适应阈值在安全性指标上优于静态阈值(精确度 0.95 对 0.75;召回率 0.98 对 0.80;F1 0.96 对 0.77),并将误报率降低约 80%(从 15 降至 3)
- 自适应校准降低领域过拒绝:创意写作的误报从 25% 降至 3%,医疗领域从 15% 降至 2%
- 基于嵌入的重复检测在消融研究中阻止了 100% 的无限循环或失控循环
- 总体而言,该系统实现近实时保护,具备强安全保障和可扩展部署潜力
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。