QUICK REVIEW

[論文レビュー] How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization?

Aniket Deroy, Kripabandhu Ghosh|arXiv (Cornell University)|Jun 2, 2023

Artificial Intelligence in Law被引用数 24

ひとこと要約

この研究は、インド最高裁判決に対する事前学習済みの抽象的要約モデルと一般的大規模言語モデルを評価し、抽象的手法の標準指標がやや高い一方で、重大な一貫性の欠如と幻覚が見られ、ヒューマン・イン・ザ・ループのアプローチが依然として必要であることを示唆します。

ABSTRACT

Automatic summarization of legal case judgements has traditionally been attempted by using extractive summarization methods. However, in recent years, abstractive summarization models are gaining popularity since they can generate more natural and coherent summaries. Legal domain-specific pre-trained abstractive summarization models are now available. Moreover, general-domain pre-trained Large Language Models (LLMs), such as ChatGPT, are known to generate high-quality text and have the capacity for text summarization. Hence it is natural to ask if these models are ready for off-the-shelf application to automatically generate abstractive summaries for case judgements. To explore this question, we apply several state-of-the-art domain-specific abstractive summarization models and general-domain LLMs on Indian court case judgements, and check the quality of the generated summaries. In addition to standard metrics for summary quality, we check for inconsistencies and hallucinations in the summaries. We see that abstractive summarization models generally achieve slightly higher scores than extractive models in terms of standard summary evaluation metrics such as ROUGE and BLEU. However, we often find inconsistent or hallucinated information in the generated abstractive summaries. Overall, our investigation indicates that the pre-trained abstractive summarization models and LLMs are not yet ready for fully automatic deployment for case judgement summarization; rather a human-in-the-loop approach including manual checks for inconsistencies is more suitable at present.

研究の動機と目的

法的事件判決のドメイン専用抽象サマリゼーションモデルの有効性を評価する。
抽象モデル、一般ドメインのLLMs、抽出ベースラインをインド最高裁判決で比較する。
標準的な要約指標だけでなく、出力の一貫性と幻覚リスクも評価する。

提案手法

一般ドメインのLLMs（Text-Davinci-003 および Turbo-GPT-3.5）をTL;DR および full-summarize プロンプトで適用する。
法的ドメインの抽象モデル（Legal-Pegasus, LegLED）とドメイン内でファインチューニングした変種（LegPegasus-IN, LegLED-IN）を適用する。
比較のために抽出ベースライン（CaseSummarizer, BertSum, SummaRunner/RNN_RNN）を適用する。
長文をチャンク化して処理（1つのチャンクあたり <=1024語）し、チャンク要約を連結する。
標準指標（ROUGE, METEOR, BLEU）と一貫性指標（SummaC, NumPrec, NEPrec）を計算する。
金標準要約に対する圧縮比を維持するよう、チャンク化と対象要約長を調整する。

実験結果

リサーチクエスチョン

RQ1ドメイン専用の抽象モデルはインドの法的判決に対して一般ドメインのLLMsとどう比較されるか？
RQ2抽象モデルは一貫性と事実正確性の低下を招くことなく、より流暢な要約を生成できるか？
RQ3完全自動デプロイは実現可能か、それとも法的判決要約には人間を介在させるループが依然必要か？
RQ4ドメイン内ファインチューニングは要約品質と一貫性にどのような影響を与えるか？

主な発見

モデル	R2-P	R2-R	R2-F1	RL-P	RL-R	RL-F1	METEOR	BLEU (%)
chatgpt-tldr	0.2391	0.1428	0.1729	0.2956*	0.1785	0.2149	0.1634	7.39
chatgpt-summ	0.1964	0.1731	0.1818	0.2361	0.2087	0.2188	0.1962	10.82
davinci-tldr	0.2338	0.1255	0.1568	0.2846	0.1529	0.1901	0.1412	6.82
davinci-summ	0.2202	0.1795	0.1954	0.2513	0.2058	0.2234	0.1917	11.41
LegPegasus	0.1964	0.1203	0.1335	0.2639	0.1544	0.1724	0.1943	13.14
LegPegasus-IN	0.2644	0.2430	0.2516	0.2818*	0.2620	0.2698	0.1967	18.66
LegLED	0.1115	0.1072	0.1085	0.1509	0.1468	0.1477	0.1424	8.43
LegLED-IN	0.2608	0.2531	0.2550	0.2769	0.2691*	0.2711*	0.2261	19.81
CaseSummarizer	0.2512	0.2269	0.2381	0.2316	0.2085	0.2191	0.1941	15.46
SummaRunner/RNN_RNN	0.2276	0.2103	0.2180	0.1983	0.1825	0.1893	0.2038	17.58
BertSum	0.2474	0.2177	0.2311	0.2243	0.1953	0.2082	0.2037	18.16

抽象モデルは通常、抽出ベースラインよりも高いROUGE、METEOR、BLEUを達成する一方で、多くの指標で最高のドメイン専用抽象モデルにはLLMsが遅れをとる。
ドメイン内ファインチューニング済みモデル（LegPegasus-IN, LegLED-IN）は、非INの対応モデルよりも優れており、ドメイン特化のファインチューニングの価値を示す。
抽象モデルとLLMsは、幻覚や不正確なエンティティ・数値を含む一貫性の問題を顕在化させ、法的利用の信頼性を低下させる。
SummaC, NumPrec, NEPrecは特定のドメインモデルで一貫性が高まることを示すが、特にLegLED系で幻覚が見られる。
全体として、事前学習済みの抽象モデルとLLMsは要件の満たす完全自動デプロイにはまだ不十分であり、人間を介在させるワークフローが望ましい。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。