QUICK REVIEW

[论文解读] How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization?

Aniket Deroy, Kripabandhu Ghosh|arXiv (Cornell University)|Jun 2, 2023

Artificial Intelligence in Law被引用 24

一句话总结

该研究在印度最高法院判决上评估预训练的 abstractive 法律摘要模型和通用大型语言模型，发现 abstractive 方法在标准指标上略优，但存在显著不一致性与幻觉，表明仍需人机协作的做法。

ABSTRACT

Automatic summarization of legal case judgements has traditionally been attempted by using extractive summarization methods. However, in recent years, abstractive summarization models are gaining popularity since they can generate more natural and coherent summaries. Legal domain-specific pre-trained abstractive summarization models are now available. Moreover, general-domain pre-trained Large Language Models (LLMs), such as ChatGPT, are known to generate high-quality text and have the capacity for text summarization. Hence it is natural to ask if these models are ready for off-the-shelf application to automatically generate abstractive summaries for case judgements. To explore this question, we apply several state-of-the-art domain-specific abstractive summarization models and general-domain LLMs on Indian court case judgements, and check the quality of the generated summaries. In addition to standard metrics for summary quality, we check for inconsistencies and hallucinations in the summaries. We see that abstractive summarization models generally achieve slightly higher scores than extractive models in terms of standard summary evaluation metrics such as ROUGE and BLEU. However, we often find inconsistent or hallucinated information in the generated abstractive summaries. Overall, our investigation indicates that the pre-trained abstractive summarization models and LLMs are not yet ready for fully automatic deployment for case judgement summarization; rather a human-in-the-loop approach including manual checks for inconsistencies is more suitable at present.

研究动机与目标

评估用于法律案件判决的领域专用 abstractive 摘要模型的有效性。
比较 abstractive 模型、通用领域 LLMs 与提取式基线在印度最高法院判决上的表现。
不仅评估标准摘要指标，还评估输出的一致性与幻觉风险。

提出的方法

应用通用领域的 LLMs（Text-Davinci-003 和 Turbo-GPT-3.5）使用 TL;DR 与 full-summarize 提示。
应用法律领域的 abstractive 模型（Legal-Pegasus、LegLED）以及领域内微调变体（LegPegasus-IN、LegLED-IN）。
为比较应用提取式基线（CaseSummarizer、BertSum、SummaRunner/RNN_RNN）。
通过分块处理长文档（每块 ≤1024 单词）并拼接块摘要。
计算标准指标（ROUGE、METEOR、BLEU）和一致性指标（SummaC、NumPrec、NEPrec）。
调整分块和目标摘要长度，以保持与黄金标准摘要的压缩比。

实验结果

研究问题

RQ1领域特定的 abstractive 模型与印度法律判决上的通用领域 LLMs 相比表现如何？
RQ2abstractive 模型是否在流畅性上有所提升，但在一致性和事实准确性方面有代价？
RQ3是否可以实现完全自动化部署，还是仍需要人机协作来进行法律判决摘要？
RQ4领域内微调对摘要质量和一致性有何影响？

主要发现

模型	R2-P	R2-R	R2-F1	RL-P	RL-R	RL-F1	METEOR	BLEU (%)
chatgpt-tldr	0.2391	0.1428	0.1729	0.2956*	0.1785	0.2149	0.1634	7.39
chatgpt-summ	0.1964	0.1731	0.1818	0.2361	0.2087	0.2188	0.1962	10.82
davinci-tldr	0.2338	0.1255	0.1568	0.2846	0.1529	0.1901	0.1412	6.82
davinci-summ	0.2202	0.1795	0.1954	0.2513	0.2058	0.2234	0.1917	11.41
LegPegasus	0.1964	0.1203	0.1335	0.2639	0.1544	0.1724	0.1943	13.14
LegPegasus-IN	0.2644	0.2430	0.2516	0.2818*	0.2620	0.2698	0.1967	18.66
LegLED	0.1115	0.1072	0.1085	0.1509	0.1468	0.1477	0.1424	8.43
LegLED-IN	0.2608	0.2531	0.2550	0.2769	0.2691*	0.2711*	0.2261	19.81
CaseSummarizer	0.2512	0.2269	0.2381	0.2316	0.2085	0.2191	0.1941	15.46
SummaRunner/RNN_RNN	0.2276	0.2103	0.2180	0.1983	0.1825	0.1893	0.2038	17.58
BertSum	0.2474	0.2177	0.2311	0.2243	0.1953	0.2082	0.2037	18.16

abstractive 模型通常在 ROUGE、METEOR 与 BLEU 上高于提取式基线，而在许多指标上，LLMs 落后于最佳领域特定 abstractive 模型。
领域内微调模型（LegPegasus-IN、LegLED-IN）胜过它们的非 IN 对应版本，突显领域特定微调的价值。
abstractive 模型和 LLMs 显示出显著的一致性问题，包括幻觉和实体或数字错误，降低了用于法律场景的可靠性。
SummaC、NumPrec、NEPrec 对某些领域模型显示出更高的一致性，但仍揭示幻觉，尤其在 LegLED 变体中。
总体而言，预训练的 abstractive 模型和 LLMs 尚未准备好在案件判决摘要方面实现 fully automatic 部署；建议采用人机协同工作流程。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。