QUICK REVIEW

[论文解读] Evaluating the Zero-shot Robustness of Instruction-tuned Language Models

Jiuding Sun, Chantal Shaib|arXiv (Cornell University)|Jun 20, 2023

Topic Modeling被引用 12

一句话总结

本文表明，对零样本任务中未观察到的指令表述，指令微调的大语言模型（LLMs）是敏感的，并提出了一种软提示对齐方法以提高鲁棒性。

ABSTRACT

Instruction fine-tuning has recently emerged as a promising approach for improving the zero-shot capabilities of Large Language Models (LLMs) on new tasks. This technique has shown particular strength in improving the performance of modestly sized LLMs, sometimes inducing performance competitive with much larger model variants. In this paper we ask two questions: (1) How sensitive are instruction-tuned models to the particular phrasings of instructions, and, (2) How can we make them more robust to such natural language variation? To answer the former, we collect a set of 319 instructions manually written by NLP practitioners for over 80 unique tasks included in widely used benchmarks, and we evaluate the variance and average performance of these instructions as compared to instruction phrasings observed during instruction fine-tuning. We find that using novel (unobserved) but appropriate instruction phrasings consistently degrades model performance, sometimes substantially so. Further, such natural instructions yield a wide variance in downstream performance, despite their semantic equivalence. Put another way, instruction-tuned models are not especially robust to instruction re-phrasings. We propose a simple method to mitigate this issue by introducing ``soft prompt'' embedding parameters and optimizing these to maximize the similarity between representations of semantically equivalent instructions. We show that this method consistently improves the robustness of instruction-tuned models.

研究动机与目标

在测试时评估指令微调语言模型（Flan、Alpaca、T0 系列）对新颖、语义等价指令的响应。
在使用未观测指令时，量化在 MMLU 和 Big-Bench Lite 基准上的鲁棒性下降。
提出一种轻量级方法，通过软提示对齐语义等价指令的表征来提升鲁棒性。
评估随着模型规模扩大和采用上下文学习（ICL）时鲁棒性是否提升。
发布为鲁棒性分析收集的指令数据集，以支持未来工作。

提出的方法

收集来自36位NLP研究者、用于75个任务的319条人工撰写指令，以创建未观测到的指令。
在 Flan-T5、Alpaca 与 T0 变体上，评估观测到的指令与未观测指令在 MMLU 和 Big-Bench Lite 上的表现。
分析观测到的与未观测指令在表示层次的相似性（倒数第二层、tSNE）之间的差异。
引入带有KL散度的软提示对齐目标以对齐语义等价的指令。
仅微调软提示参数（前缀标记），其余模型冻结。
使用GPT-4对参考指令进行复述，以生成用于对齐训练的复述集合。

Figure 1 : How well do models trained on instruction-tuning datasets generalize to novel instructions (unobserved in training)? Our analysis suggests that they do not do so very well. Above we show a case where pairing an example with an observed instruction yields the correct output, while providin

实验结果

研究问题

RQ1指令微调的语言模型在测试时对指令表述的变化有多敏感？
RQ2新颖但语义等价的指令是否会在不同模型家族和基准上降低零样本性能？
RQ3在不进行完整模型微调的情况下，是否有可轻量化的对齐目标来提升对未见指令的鲁棒性？
RQ4鲁棒性是否会随模型规模扩大或上下文学习（ICL）而提高？

主要发现

未观测的、语义等价的指令在不同模型和任务中始终降低准确性（在若干设置下平均下降超过5点）。
分类任务受到未见的指令表述影响尤为明显，在 BC/MC 任务中的下降更大。
一种简单的软提示对齐方法（引入可训练的前缀嵌入和KL散度损失）提高鲁棒性，缩小未观测指令的性能差距。
在模型规模增至11B时，鲁棒性并未完全消失，表明潜在提升可能需要更大模型或进一步方法。
上下文学习（ICL）略微缓解对未见指令的敏感性，但无法消除鲁棒性差距。
明确对齐语义等价指令的表示与提高准确性相关（观测指令与未观测指令的表示更接近）。

(a) Average zero-shot performance over all tasks when using observed and unobserved instructions.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。