QUICK REVIEW

[论文解读] On Learning to Summarize with Large Language Models as References

Yixin Liu, Shi, Kejian|arXiv (Cornell University)|May 23, 2023

Topic Modeling被引用 10

一句话总结

本文研究将 LLM 作为摘要参考来源的可行性，使用基于 LLM 的评估信号并结合对比学习来训练较小模型，并分析与人工评估的一致性。

ABSTRACT

Recent studies have found that summaries generated by large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. Therefore, we study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved. To this end, we use LLMs as both oracle summary generators for standard supervised fine-tuning and oracle summary evaluators for efficient contrastive learning that leverages the LLMs' supervision signals. We conduct comprehensive experiments with source news articles and find that (1) summarization models trained under the LLM-as-reference setting achieve significant performance improvement in both LLM and human evaluations; (2) contrastive learning outperforms standard supervised fine-tuning under both low and high resource settings. Our experimental results also enable a meta-analysis of LLMs' summary evaluation capacities under a challenging setting, showing that LLMs are not well-aligned with human evaluators. Particularly, our expert human evaluation reveals remaining nuanced performance gaps between LLMs and our fine-tuned models, which LLMs fail to capture. Thus, we call for further studies into both the potential and challenges of using LLMs in summarization model development.

研究动机与目标

研究将 LLM-as-reference 作为 abstractive summarization 的学习设置。
评估基于 LLM 的评估信号（GPTScore、GPTRank）如何引导较小模型的训练。
应用对比学习以利用 LLM 指导并与 MLE 基线进行比较。
进行人工评估和元分析以评估基于 LLM 的判断与人工判断之间的一致性。

提出的方法

模型 g（例如 BART）在来自 LLM 的准参考摘要上进行 MLE 训练。
使用 LLM 提供质量信号（GPTScore 或 GPTRank）来指导训练。
采用 BRIO 风格的对比学习以将高质量摘要推至高于低质量摘要。
将交叉熵损失与对比损失结合为多任务目标（L_mul）。
通过多样化束搜索生成多个候选摘要以用于对比排序。
使用 ROUGE 相对于 LLM 参考和基于 LLM 的评估指标（GPTScore、GPTRank）进行自动评估。
开展人工成对评估（显著性、连贯性、总体）以及专家标注用于元分析。

实验结果

研究问题

RQ1使用 LLM 指导训练的较小模型在基于 LLM 的评估下能否达到 LLM 的性能？
RQ2GPTScore 和 GPTRank 信号相较标准 MLE 如何影响训练？
RQ3在基于 LLM 的评估下的改进是否与人工判断一致？
RQ4从元分析中揭示的 LLM 作为参考设定的局限性与风险是什么？

主要发现

长度	GPTScore	R1	R2	长度。
GPT3D3	-22.62	-0.271	100.0	100.0	85.4
BART	-59.55	-0.789	46.85	24.38	79.0
GPT3D2	-41.21	-0.547	55.40	33.72	78.7
Alpaca	-44.82	-0.567	51.53	30.18	81.8
ChatGPT	-45.12	-0.498	58.14	37.46	92.0
BART.ChatGPT	-41.08	-0.446	54.26	33.98	93.7
BART.GPT3D3	-36.13	-0.420	59.50	40.70	85.6
BRIO.GPT3D3	-26.20	-0.318	56.21	36.47	83.7

如 BART 等较小模型在经过 LLM 指导信号和对比学习训练后，在基于 LLM 的评估下可达到接近 LLM 的性能。
BRIO.GPT3D3 在仅约 100 个对比样本的情况下达到了与参考 LLM（GPT3D3）相似的 GPTScore。
对比学习在利用 LLM 指导的自动评估信号（GPTScore/GPTRank）时，通常优于 MLE 训练。
基于 GPTRank 的评估结果取决于所使用的参考 LLM（ChatGPT 与 GPT-4），表明评估方法对结果敏感。
人工评估显示较小模型在人工判断上仍未超越 LLM，表明基于 LLM 的评估与人工评估之间存在错配。
元分析表明基于 LLM 的评估对训练有价值，但在忠实对齐人工偏好方面存在局限性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。