QUICK REVIEW

[论文解读] UNITE: A Unified Benchmark for Text-to-SQL Evaluation

Wuwei Lan, Zhiguo Wang|arXiv (Cornell University)|May 25, 2023

Natural Language Processing Techniques被引用 9

一句话总结

UNITE 将18个公开的文本到SQL数据集合并为一个统一基准，覆盖29k个数据库，训练集97k、测试集27k样本，通过跨数据库泛化和鲁棒性挑战模型；评估显示SOTA模型的泛化能力有限，Codex在域外任务中通过上下文学习表现突出。

ABSTRACT

A practical text-to-SQL system should generalize well on a wide variety of natural language questions, unseen database schemas, and novel SQL query structures. To comprehensively evaluate text-to-SQL systems, we introduce a UNIfied benchmark for Text-to-SQL Evaluation (UNITE). It is composed of publicly available text-to-SQL datasets, containing natural language questions from more than 12 domains, SQL queries from more than 3.9K patterns, and 29K databases. Compared to the widely used Spider benchmark, we introduce $\sim$120K additional examples and a threefold increase in SQL patterns, such as comparative and boolean questions. We conduct a systematic study of six state-of-the-art (SOTA) text-to-SQL parsers on our new benchmark and show that: 1) Codex performs surprisingly well on out-of-domain datasets; 2) specially designed decoding methods (e.g. constrained beam search) can improve performance for both in-domain and out-of-domain settings; 3) explicitly modeling the relationship between questions and schemas further improves the Seq2Seq models. More importantly, our benchmark presents key challenges towards compositional generalization and robustness issues -- which these SOTA models cannot address well. Our code and data processing script are available at https://github.com/awslabs/unified-text2sql-benchmark

研究动机与目标

提供覆盖多样领域、模式、NLQ结构和SQL结构的全面文本到SQL基准。
实现跨先前零散数据集的“ Apples-to-apples”评估。
分析SOTA模型在域内和域外任务上的表现，识别组成泛化和模式链接等关键瓶颈。

提出的方法

将18个公开的文本到SQL数据集聚合为统一的JSONL/SQLite格式。
将NLQ/SQL对转换为共同的模式表示，包含原始和清理后的表名/列名及键。
为每个样例提供三个字段：数据库标识符、问题和SQL查询；在JSON中保留模式细节。
对六个SOTA模型（包括Codex、UL-20B、T5-3B、RASAT、SmBoP、PICARD）在零-shot和少-shot设置下进行评估。
通过运行预测SQL与真实SQL在数据库上的执行来使用执行准确性作为主评估指标。

实验结果

研究问题

RQ1 state-of-the-art文本到SQL模型在大规模、多样化跨数据库基准上的泛化能力如何？
RQ2在域内和域外评估中，使用Spider训练与使用UNITE训练的影响是什么？
RQ3解码策略（如受限束搜索）和关系感知的模式建模能否提高跨域性能？
RQ4基于大语言模型的推断（如Codex）与微调模型在域外数据上的比较如何？
RQ5现有SOTA模型在组成泛化和鲁棒性方面还存在哪些挑战？

主要发现

UNITE是迄今为止最大的文本到SQL基准，覆盖29k数据库、训练集97k、测试集27k。
六个SOTA解析器在UNITE上的平均准确率低于50%，凸显实际世界泛化能力有限。
Codex在域外任务上通过上下文学习实现最佳表现。
定制解码（如受限束搜索）在域内和域外设置下均提升了Seq2Seq模型的表现。
关系感知的模式建模（如关系感知自注意力）相对于基线Seq2Seq方法有优势。
UNITE揭示了鲁棒性和组成泛化方面的差距，这些并未被当前SOTA方法完全解决。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。