QUICK REVIEW

[论文解读] From Zero to Hero: Examining the Power of Symbolic Tasks in Instruction Tuning

Qian Liu, Fan Zhou|arXiv (Cornell University)|Apr 17, 2023

Topic Modeling被引用 9

一句话总结

论文研究通过符号化任务，特别是 SQL 执行，来提升指令微调和零-shot 泛化，在表格推理方面取得显著提升，同时不损害对其他任务的泛化能力。

ABSTRACT

Fine-tuning language models on tasks with instructions has demonstrated potential in facilitating zero-shot generalization to unseen tasks. In this paper, we introduce a straightforward yet effective method for enhancing instruction tuning by employing symbolic tasks. Compared to crowdsourced human tasks or model-generated tasks, symbolic tasks present a unique advantage as they can be easily generated in vast quantities, theoretically providing an infinite supply of high-quality training instances. To explore the potential of symbolic tasks, we carry out an extensive case study on the representative symbolic task of SQL execution. Empirical results on various benchmarks validate that the integration of SQL execution leads to significant improvements in zero-shot scenarios, particularly in table reasoning. Notably, our 3B model surpasses both the 175B GPT-3 and ChatGPT in zero-shot table reasoning across four benchmarks. Furthermore, experimental results on BBH (27 tasks) and MMLU (57 tasks) reveal that language models can be enhanced through symbolic tasks without compromising their generality. We hope that our paper serves as a catalyst, inspiring increased efforts to incorporate symbolic tasks in instruction tuning.

研究动机与目标

探究符号化任务是否能增强指令微调以实现对未见任务的零-shot 泛化。
评估将符号化任务整合对表格推理基准及其他方面的影响。
评估符号化任务是否影响对通用 Held-out 任务的性能。

提出的方法

通过可执行的 SQL 模板在公开表格上实例化，综合大规模 SQL 执行语料库。
通过将符号化任务数据与多样化的 NL 任务数据结合进行多任务微调（参考 FLAN-T5 的 rehearsal 策略）。
提出一种无训练需求的替代方案：通过将 SQL 执行结果包含在指令提示中来生成合成演示。
在表格推理基准（WTQ、WikiSQL-Weak、SQA、TabFact）以及非表格任务（SVAMP、BBH、MMLU）上评估零-shot 性能。
与基线对比包括 FLAN-T5 变体、GPT-3 模型以及 TaPEx Zero。

实验结果

研究问题

RQ1RQ1: 该方法在没有现实示例的情况下是否提升表格推理能力？
RQ2RQ2: 该方法是否对超出表格推理的任务有帮助？
RQ3RQ3: 该方法是否会削弱对通用任务的性能？

主要发现

Model	WTQ	SQA	WikiSQL-Weak	TabFact
Fine-tuned SOTA	62.8	74.5	89.5	92.1
TaPEx	4.1	4.0	21.2	–
GPT-3 (code-davinci-002)	40.4	10.5	55.2	64.1
ChatGPT (gpt-3.5-turbo)	42.9	13.7	26.1	68.8
FLAN-T5 (Large)	30.2	18.9	29.0	59.9
TaPEx Zero (Large)	41.9 (+11.7)	29.9 (+11.0)	62.6 (+33.6)	63.9 (+4.0)
FLAN-T5 (XL)	39.5	16.8	38.2	66.3
TaPEx Zero (XL)	50.2 (+10.7)	34.1 (+17.3)	70.5 (+32.3)	72.3 (+6.0)

TaPEx Zero 在基于 FLAN-T5 的模型上显著提升表格推理基准的表现，超过基线并接近或超过某些更大模型。
TaPEx Zero XL 与 TaPEx Zero Large 在 WTQ、SQA、WikiSQL-Weak、TabFact 上对 FLAN-T5 基线显著获益。
符号化任务也提升了 SVAMP 的数值推理，并且对 BBH 和 MMLU 的性能没有下降，表明保持了泛化性。
使用包含 SQL 执行的合成演示能够带来显著的零-shot 增益，且在只有少量演示的情况下可与现实演示相当。
TaPEx Zero 即使没有现实任务示例也能表现出强大的零-shot 性能，并且随模型规模的增大而提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。