QUICK REVIEW

[论文解读] WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia

Sina J. Semnani, Violet Z. Yao|arXiv (Cornell University)|May 23, 2023

Topic Modeling参考文献 43被引用 9

一句话总结

WikiChat 是一个 few-shot、以维基百科为基础的聊天机器人流程，结合检索、LLM 生成和逐条主张事实核验，并经蒸馏为较小的模型以提升延迟、成本和隐私收益。

ABSTRACT

This paper presents the first few-shot LLM-based chatbot that almost never hallucinates and has high conversationality and low latency. WikiChat is grounded on the English Wikipedia, the largest curated free-text corpus. WikiChat generates a response from an LLM, retains only the grounded facts, and combines them with additional information it retrieves from the corpus to form factual and engaging responses. We distill WikiChat based on GPT-4 into a 7B-parameter LLaMA model with minimal loss of quality, to significantly improve its latency, cost and privacy, and facilitate research and deployment. Using a novel hybrid human-and-LLM evaluation methodology, we show that our best system achieves 97.3% factual accuracy in simulated conversations. It significantly outperforms all retrieval-based and LLM-based baselines, and by 3.9%, 38.6% and 51.0% on head, tail and recent knowledge compared to GPT-4. Compared to previous state-of-the-art retrieval-based chatbots, WikiChat is also significantly more informative and engaging, just like an LLM. WikiChat achieves 97.9% factual accuracy in conversations with human users about recent topics, 55.0% better than GPT-4, while receiving significantly higher user ratings and more favorable comments.

研究动机与目标

通过将大语言模型输出以可信语料库（维基百科）为基础来提升开放域聊天机器人的事实性。
通过七阶段检索-生成流程实现高对话性与低延迟。
展示将多阶段系统蒸馏为更小模型而不牺牲质量。
提供将模拟数据与真实用户数据相结合，并配合人类-LLM 评估的评估方法。

提出的方法

阶段1：从用户话语生成查询并通过时间顺序重新排序检索维基百科段落。
阶段2：提取并将相关段落摘要为要点以用于定位。
阶段3：用对话历史和要点提示大语言模型生成回答。
阶段4：将 LLM 的回答拆解为若干主张，并为每个主张检索证据。
阶段5：使用推理链提示将每个主张分类为支持、反驳或不确定，并丢弃不支持的主张。
阶段6：从有据点要点和历史记录起草最终回答；阶段7：根据相关性、自然度、避免重复以及时序正确性等反馈对草稿进行润色。

实验结果

研究问题

RQ1一个以可信语料库为基础的少样本大语言模型是否能够产生事实性强且具吸引力的回答，同时保持较低的幻觉率？
RQ2七阶段检索-基于证据的流程在事实性、对话性和延迟方面，与仅检索和纯LLM基线相比如何？
RQ3将 WikiChat 蒸馏成更小的模型是否能在降低延迟和成本的同时保持事实性和对话性？
RQ4哪些评估方法最适合评估具知识支撑的聊天机器人的事实性和对话性？

主要发现

WikiChat GPT-4 在模拟对话中的事实性准确率达到 97.3%，在真实用户对话中达到 97.9%。
WikiChat 的变体在事实性方面优于 Atlas（最先进的基于检索的模型），在对话性方面与大语言模型相当。
将 WikiChat G4 蒸馏为 7B 的 LLaMA 模型可实现 91.1% 的事实性准确率，且端到端延迟比教师模型低 3.2 倍。
WikiChat 的事实性优势在冷门和最近知识上最为明显，相较于 GPT-4。
最终回答中约三分之一的主张被事实核验阶段拒绝，凸显对主张级别核验的重要性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。