Skip to main content
QUICK REVIEW

[论文解读] Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools

Varun Magesh, Faiz Surani|arXiv (Cornell University)|May 30, 2024
Law, AI, and Intellectual Property被引用 32
一句话总结

本论文进行 preregisteredly 评估 Lexis+ AI、Westlaw AI-Assisted Research 和 Ask Practical Law AI 与 GPT-4 的对比,显示出显著的幻觉现象以及各工具之间的准确性差异。

ABSTRACT

Legal practice has witnessed a sharp rise in products incorporating artificial intelligence (AI). Such tools are designed to assist with a wide range of core legal tasks, from search and summarization of caselaw to document drafting. But the large language models used in these tools are prone to "hallucinate," or make up false information, making their use risky in high-stakes domains. Recently, certain legal research providers have touted methods such as retrieval-augmented generation (RAG) as "eliminating" (Casetext, 2023) or "avoid[ing]" hallucinations (Thomson Reuters, 2023), or guaranteeing "hallucination-free" legal citations (LexisNexis, 2023). Because of the closed nature of these systems, systematically assessing these claims is challenging. In this article, we design and report on the first preregistered empirical evaluation of AI-driven legal research tools. We demonstrate that the providers' claims are overstated. While hallucinations are reduced relative to general-purpose chatbots (GPT-4), we find that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time. We also document substantial differences between systems in responsiveness and accuracy. Our article makes four key contributions. It is the first to assess and report the performance of RAG-based proprietary legal AI tools. Second, it introduces a comprehensive, preregistered dataset for identifying and understanding vulnerabilities in these systems. Third, it proposes a clear typology for differentiating between hallucinations and accurate legal responses. Last, it provides evidence to inform the responsibilities of legal professionals in supervising and verifying AI outputs, which remains a central open question for the responsible integration of AI into law.

研究动机与目标

  • 评估领先 AI 法律研究工具中幻觉的普遍性及性质。
  • 创建一个 preregistered、领域特定的法律查询数据集用于系统评估。
  • 在基于 RAG 的系统中,开发区分幻觉与准确法律应答的类型学。
  • 提供证据以引导律师在法律任务中使用 AI 时的监督与核验实践。

提出的方法

  • 建立一个正式框架,区分法律输出的正确性与基于证据的扎根性。
  • 人工整理一个包含超过200个法律查询的 preregistered 数据集。
  • 在该数据集上评估 Lexis+ AI、Westlaw AI-Assisted Research、Ask Practical Law AI 和 GPT-4。
  • 人工审核输出的准确性及对权威性的忠实程度。
  • 将基于 RAG 的工具与通用模型(GPT-4)进行比较,以评估相对改进与仍存的风险。
Figure 1 : Comparison of hallucinated and incomplete answers across generative legal research tools. Hallucinated responses are those that include false statements or falsely assert a source supports a statement. Incomplete responses are those that fail to either address the user’s query or provide
Figure 1 : Comparison of hallucinated and incomplete answers across generative legal research tools. Hallucinated responses are those that include false statements or falsely assert a source supports a statement. Incomplete responses are those that fail to either address the user’s query or provide

实验结果

研究问题

  • RQ1在现实世界查询中,领先的 AI 法律研究工具的幻觉率是多少?
  • RQ2这些工具在准确性和扎根性方面与权威来源相比如何?
  • RQ3相较于通用目的的 LLM,基于 RAG 的方法是否能显著降低幻觉?
  • RQ4对律师在监督和核验 AI 输出方面有哪些实际影响?

主要发现

  • Lexis+ AI 能对 65% 的查询给出准确答案。
  • Westlaw AI-Assisted Research 的准确率为 42%。
  • Ask Practical Law AI 在超过 60% 的查询中提供不完整或没有扎根的回答。
  • 所有工具在某些情况下均显示出不可忽略的幻觉率,介于 17% 到 33% 之间。
  • RAG 相对于 GPT-4 提高了性能,但在法律任务中仍未消除幻觉。
Figure 2 : Top left: Example of a hallucinated response by Westlaw’s AI-Assisted Research product. The system makes up a statement in the Federal Rules of Bankruptcy Procedure that does not exist. Top right: Example of a hallucinated response by LexisNexis’s Lexis+ AI. Casey and its undue burden sta
Figure 2 : Top left: Example of a hallucinated response by Westlaw’s AI-Assisted Research product. The system makes up a statement in the Federal Rules of Bankruptcy Procedure that does not exist. Top right: Example of a hallucinated response by LexisNexis’s Lexis+ AI. Casey and its undue burden sta

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。