QUICK REVIEW

[论文解读] The False Promise of Imitating Proprietary LLMs

Arnav Gudibande, Eric Wallace|arXiv (Cornell University)|May 25, 2023

Topic Modeling被引用 50

一句话总结

本文批判性评估通过在更强模型（如 ChatGPT）输出上微调开源模型来模仿专有大型语言模型（LLMs）的做法。研究发现广泛模仿在很大程度上无法缩小能力差距，而对特定任务的局部模仿更为可行；总体而言，提升开源基础模型比进行模仿更有效。

ABSTRACT

An emerging method to cheaply improve a weaker language model is to finetune it on outputs from a stronger model, such as a proprietary system like ChatGPT (e.g., Alpaca, Self-Instruct, and others). This approach looks to cheaply imitate the proprietary model's capabilities using a weaker open-source model. In this work, we critically analyze this approach. We first finetune a series of LMs that imitate ChatGPT using varying base model sizes (1.5B--13B), data sources, and imitation data amounts (0.3M--150M tokens). We then evaluate the models using crowd raters and canonical NLP benchmarks. Initially, we were surprised by the output quality of our imitation models -- they appear far better at following instructions, and crowd workers rate their outputs as competitive with ChatGPT. However, when conducting more targeted automatic evaluations, we find that imitation models close little to none of the gap from the base LM to ChatGPT on tasks that are not heavily supported in the imitation data. We show that these performance discrepancies may slip past human raters because imitation models are adept at mimicking ChatGPT's style but not its factuality. Overall, we conclude that model imitation is a false promise: there exists a substantial capabilities gap between open and closed LMs that, with current methods, can only be bridged using an unwieldy amount of imitation data or by using more capable base LMs. In turn, we argue that the highest leverage action for improving open-source models is to tackle the difficult challenge of developing better base LMs, rather than taking the shortcut of imitating proprietary systems.

研究动机与目标

评估在 ChatGPT 输出上对开源 LMs 进行微调是否能在各任务上匹配专有模型。
研究模仿数据量、基础模型大小和数据来源如何影响性能。
比较基于众包的评估与自动化评估，以揭示评估差异。
评估模仿是否在事实性、编码能力和问题解决方面超越表层指令遵循的水平。

提出的方法

在模仿数据集上对 1.5B–13B 的解码器仅模型进行微调（如 GPT-2 1.5B、LLaMA 7B、LLaMA 13B）。
创建面向任务的（NQ-synthetic）模仿数据集和覆盖广泛的模仿数据集（ShareGPT-Mix、HC3、Discord ChatGPT Bots）。
使用人工众包评分（盲对比与 ChatGPT 比较）和 GPT-4 评估，以及自动化基准测试（MMLU、Natural Questions、HumanEval）进行评估。
通过改变模仿数据规模（0.3M–150M tokens）和基础模型规模来研究数据规模效应。
通过有针对性的自动评估分析模仿风格、事实性和内容之间的差异。

实验结果

研究问题

RQ1广泛覆盖的对 ChatGPT 的模仿是否提高开源 LM 在标准基准和实际任务上的表现？
RQ2本地化（任务特定）的模仿是否能够在 Natural Questions 等具体任务上缩小与 ChatGPT 的差距？
RQ3模仿数据量与基础模型规模如何叠加影响质量和事实性？
RQ4为何众包评估有时会将模仿输出评为接近 ChatGPT，尽管事实性较弱？
RQ5对于开源 LM 的开发与政策有哪些实际影响？

主要发现

广泛覆盖的模仿在大多数任务上的表现并未提升，甚至可能使性能下降，相较于基础 LM。
增加基础模型规模会持续提升结果，而增加模仿数据对广泛模仗几乎无益。
任务特定的（NQ-synthetic）模仿在 Natural Questions 上显著缩小了与 ChatGPT 的差距，显示局部模仿更可行。
模仿模型模仿 ChatGPT 风格，但在事实性与内容准确性方面落后，这在有针对性的自动评估和真实基准中有所体现。
众包评估和 GPT-4 评估显示类似趋势，注重风格的模仿得分较高，但事实性内容落后。
模仿数据可以降低有害性，沿用目标模型的安全指南，但整体收益仅限于风格模仿。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。