QUICK REVIEW

[论文解读] RegGuard: AI-Powered Retrieval-Enhanced Assistant for Pharmaceutical Regulatory Compliance

Siyuan Yang, Xihan Bian|arXiv (Cornell University)|Jan 25, 2026

Biomedical Text Mining and Ontologies被引用 0

一句话总结

RegGuard 是一个企业级 AI 助手，使用分层语义分块（HiSACC）和域适应跨编码器重新排序（ReLACE）检索并对制药合规文本进行 grounding，降低幻觉并提升回答质量。

ABSTRACT

The increasing frequency and complexity of regulatory updates present a significant burden for multinational pharmaceutical companies. Compliance teams must interpret evolving rules across jurisdictions, formats, and agencies, often manually, at high cost and risk of error. We introduce RegGuard, an industrial-scale AI assistant designed to automate the interpretation of heterogeneous regulatory texts and align them with internal corporate policies. The system ingests heterogeneous document sources through a secure pipeline and enhances retrieval and generation quality with two novel components: HiSACC (Hierarchical Semantic Aggregation for Contextual Chunking) semantically segments long documents into coherent units while maintaining consistency across non-contiguous sections. ReLACE (Regulatory Listwise Adaptive Cross-Encoder for Reranking), a domain-adapted cross-encoder built on an open-source model, jointly models user queries and retrieved candidates to improve ranking relevance. Evaluations in enterprise settings demonstrate that RegGuard improves answer quality specifically in terms of relevance, groundedness, and contextual focus, while significantly mitigating hallucination risk. The system architecture is built for auditability and traceability, featuring provenance tracking, access control, and incremental indexing, making it highly responsive to evolving document sources and relevant for any domain with stringent compliance demands.

研究动机与目标

应对制药领域跨司法辖区快速演变且异构的监管更新挑战。
实现对监管文本的自动解读并与内部企业政策对齐。
在基于LLM的监管分析中通过检索增强生成来降低幻觉风险。
提供可审计、可追溯的系统架构，适用于严格合规环境。

提出的方法

引入 HiSACC，一种分层语义聚合方法，从长篇监管文档中创建连贯且非连续的分块。
开发 ReLACE，一种在监管 QA 数据上以列举式目标训练的域适应跨编码器重新排序器，以提升检索后的排序质量。
采用一个检索增强生成流水线，将 Milvus 的嵌入向量经 ReLACE 重新排序后再将上下文传递给生成器。
通过安全管道以多格式企业文档（PDF、Word、Excel、Google Docs/Sheets）进行摄取，并对扫描内容应用 OCR。
在 Roche 的 Galileo AI Platform 内使用 Gradio/FastAPI 进行用户交互，并使用内部 GPT-4 Turbo 模型进行生成。
使用企业监管 QA 数据集和 RC-QA 评估框架对相关性、 grounding 和 faithfulness 进行评估。

实验结果

研究问题

RQ1分层分块如何提升长篇监管文档的语义连贯性与检索质量？
RQ2域适应跨编码器重新排序器（ReLACE）是否能在监管 QA 任务中提升检索后相关性与 grounding？
RQ3HiSACC + ReLACE 对制药监管合规场景中的准确性、 grounding 与幻觉风险有何影响？
RQ4RegGuard 是否能在企业基础设施中提供可审计的来源追溯、访问控制和可靠操作？

主要发现

HiSACC 结合 ReLACE 在多种检索设置下，在相关性、 grounding 与 faithfulness 方面持续优于基线。
HiSACC 相较传统的 RCS 提升了语义分块，减少碎片化并改善上下文对齐。
ReLACE 提供域自适应的列举式重新排序，通过更好地将查询上下文与监管段落匹配，提升 grounding 并降低幻觉。
集成系统在保持企业可接受延迟的同时，取得了强烈的 faithfulness 与 grounding。
系统部署强调可审计性、来源追溯和适用于合规环境的安全、内部化运营。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。