[论文解读] Large Language Models Struggle to Learn Long-Tail Knowledge
本文研究大型语言模型的事实知识如何取决于预训练数据中相关文档的普遍程度,使用实体链接的文档计数来显示相关性与因果关系,并探讨检索增强作为补救措施。
The Internet contains a wealth of knowledge -- from the birthdays of historical figures to tutorials on how to code -- all of which may be learned by language models. However, while certain pieces of information are ubiquitous on the web, others appear extremely rarely. In this paper, we study the relationship between the knowledge memorized by large language models and the information in pre-training datasets scraped from the web. In particular, we show that a language model's ability to answer a fact-based question relates to how many documents associated with that question were seen during pre-training. We identify these relevant documents by entity linking pre-training datasets and counting documents that contain the same entities as a given question-answer pair. Our results demonstrate strong correlational and causal relationships between accuracy and relevant document count for numerous question answering datasets (e.g., TriviaQA), pre-training corpora (e.g., ROOTS), and model sizes (e.g., 176B parameters). Moreover, while larger models are better at learning long-tail knowledge, we estimate that today's models must be scaled by many orders of magnitude to reach competitive QA performance on questions with little support in the pre-training data. Finally, we show that retrieval-augmentation can reduce the dependence on relevant pre-training information, presenting a promising approach for capturing the long-tail.
研究动机与目标
- 研究一个语言模型回答基于事实的问题的能力与包含相关实体的预训练数据量之间的关系。
- 通过实体链接识别相关的预训练文档,以量化跨大规模语料库的知识暴露。
- 评估模型规模和预训练数据规模是否能解释长尾知识的学习。
- 检验检索增强作为降低对罕见预训练信息依赖的方法。
提出的方法
- 构建一个可扩展的实体链接管道,将重要的问题/答案实体映射到预训练数据集中的文档(The Pile、ROOTS、C4、OpenWebText、Wikipedia)。
- 统计问题实体与答案实体共现的文档数量,以识别每个QA对的“相关文档”。
- 在4-shot设置下对开放域问答模型(GPT-Neo、BLOOM、GPT-3)与TriviaQA和Natural Questions进行评估,使用Ex-Act匹配,分析准确性与相关文档数量的关系。
- 对样本问题移除所有相关文档进行反事实再训练,以测试文档数量与准确性之间的因果关系。
- 探索规模效应(模型大小、数据规模)和检索增强(oracle检索和BM25检索)对罕见事实的影响。
实验结果
研究问题
- RQ1一个语言模型的问答准确性与与给定问题相关的预训练文档数量之间有何相关性?
- RQ2观察到的相关性是否具有因果性,即移除相关的预训练文档是否会降低问答性能?
- RQ3模型大小和预训练数据规模在多大程度上改善对长尾知识的学习?
- RQ4检索增强能否降低对罕见事实的预训练数据依赖?
- RQ5用于识别相关文档的替代轻量级方法是否能够像共现基方法那样解释问答性能?
主要发现
- 问答准确性在跨数据集和模型(例如 BLOOM-176B 在 TriviaQA 上)之间与相关预训练文档数量呈强相关。
- 反事实再训练显示,在许多相关文档的问题上移除相关文档会降低准确性,暗示存在因果联系。
- 模型大小与罕见事实问答性能具有强烈的对数线性关系,意味着需要巨大的参数量才能在长尾问题上达到强基线的水平。
- 检索增强显著提升性能,尤其是在罕见问题上,并可降低对预训练数据的依赖。
- Oracle 检索在罕见实例上显著提升准确性,而基于 BM25 的检索在文档计数仍有轻微依赖的情况下也提供改进。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。