[论文解读] A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization
本论文在长篇住院病程摘要上,对一系列忠诚度/真实性度量指标与细粒度的临床医生注释进行基准比较,考察领域自适应、源文本与摘要的一致性,以及指标蒸馏;发现较短的、逐句输入能获得最强的人类相关性。
Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. The factual consistency of summaries-their faithfulness-is critical to their safe usage in clinical settings. To better understand the limitations of state-of-the-art natural language processing (NLP) systems, as well as the suitability of existing evaluation metrics, we benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course. We create a corpus of patient hospital admissions and summaries for a cohort of HIV patients, each with complex medical histories. Annotators are presented with summaries and source notes, and asked to categorize manually highlighted summary elements (clinical entities like conditions and medications as well as actions like "following up") into one of three categories: "Incorrect," "Missing," and "Not in Notes." We meta-evaluate a broad set of faithfulness metrics-proposed for the general NLP domain-by measuring the correlation of metric scores to clinician ratings. Across metrics, we explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble. We find that off-the-shelf metrics with no exposure to clinical text correlate well to clinician ratings yet overly rely on copy-and-pasted text. As a practical guide, we observe that most metrics correlate best to clinicians when provided with one summary sentence at a time and a minimal set of supporting sentences from the notes before discharge.
研究动机与目标
- 收集用于长篇住院病程摘要的细粒度、句子级和要素级真实性注释。
- 在一组 HIV 患者队列上基准多种真实性度量指标,与临床医生判断进行比较。
- 研究领域自适应、输入长度和源–摘要对齐对指标表现的影响。
- 探索通过集成方法对指标进行组合和从集合中蒸馏出单一卓越的真实性度量。
提出的方法
- 在一个大型住院病程语料库上微调 Longformer Encoder-Decoder (LED),以生成长篇摘要。
- 在一个保持分离的 HIV 队列集上收集专家注释,标注摘要要素相对于源 notes 的忠实度。
- 在不同领域自适应、输入长度和对齐设置下,对多种真实性度量指标(如 BARTScore、BERTScore、SummaC、CTC)进行基准比较。
- 为每个指标实施三种领域自适应级别:现成(域外)、定制在域内、域内双重调优。
- 评估多种源–摘要对齐策略(句子级、段落级、实体链、全输入)及其对指标表现的影响。
- 从基线指标集成中提取一个度量,以提高与人类判断的相关性。
实验结果
研究问题
- RQ1哪种源输入粒度与人类真实性判断的相关性最强,在长篇临床摘要中尤为如此?
- RQ2领域自适应(域内预训练和微调)如何影响临床长篇摘要上指标的表现?
- RQ3不同的源–摘要对齐策略对指标的可靠性和与人类判断的相关性有何影响?
- RQ4指标调优是否应与使用场景对齐(调优阶段的对齐方法与推断阶段一致)以获得最佳性能?
- RQ5从指标集合中蒸馏出一个单一的卓越真实性度量是否可行?
主要发现
- 现成的度量在与人类判断相关性方面有效,但往往过度强调摘录性。
- 当按句子逐一评估而非对整份摘要评估时,指标通常与人类判断的相关性更高。
- 较短且更相关的源对齐比使用完整源输入能获得更高且更稳定的相关性。
- 域内自适应在原始相关性上提升有限,但在聚焦于抽象子集注释时会出现提升。
- 从基线指标的集合中蒸馏出的度量在与专家标签的相关性上优于单独指标。
- 基于实体和对齐感知的方法(如顶段对齐、实体链对齐)在与更广泛策略相比具备竞争性表现。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。