QUICK REVIEW

[论文解读] A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability

Aiwei Liu, Xuming Hu|arXiv (Cornell University)|Mar 12, 2023

Topic Modeling被引用 60

一句话总结

本文在12个基准上评估了 ChatGPT 的零-shot Text-to-SQL 性能，显示出强大的能力和鲁棒性，但与最先进模型仍存在一些差距；在对抗性和多轮场景中取得了显著优势。

ABSTRACT

This paper presents the first comprehensive analysis of ChatGPT's Text-to-SQL ability. Given the recent emergence of large-scale conversational language model ChatGPT and its impressive capabilities in both conversational abilities and code generation, we sought to evaluate its Text-to-SQL performance. We conducted experiments on 12 benchmark datasets with different languages, settings, or scenarios, and the results demonstrate that ChatGPT has strong text-to-SQL abilities. Although there is still a gap from the current state-of-the-art (SOTA) model performance, considering that the experiment was conducted in a zero-shot scenario, ChatGPT's performance is still impressive. Notably, in the ADVETA (RPL) scenario, the zero-shot ChatGPT even outperforms the SOTA model that requires fine-tuning on the Spider dataset by 4.1\%, demonstrating its potential for use in practical applications. To support further research in related fields, we have made the data generated by ChatGPT publicly available at https://github.com/THU-BPM/chatgpt-sql.

研究动机与目标

在多样的数据集和语言环境中评估 ChatGPT 的零-shot Text-to-SQL 能力。
在多种鲁棒性场景下，将零-shot 的 ChatGPT 与经过微调的 SOTA 模型进行对比。
识别 ChatGPT 擅长的场景，包括对抗性和多轮设置。
提供洞见，以指导文本到 SQL 任务的未来提示设计和数据增强。

提出的方法

采用固定的、OpenAI 演示式的 Text-to-SQL 提示（单轮和多轮变体）。
在 12 个公开的 Text-to-SQL 基准上评估 ChatGPT，覆盖 Spider 家族、现实世界变体、对抗性、多语言和多轮数据集。
采用基于执行的评测指标（Valid SQL、Execution Accuracy、Test-Suite），而非精确匹配。
与基于受限解码和骨架引导解码（PICARD、RASAT、RESDSQL）构建的基线进行比较，且不使用 ChatGPT 微调。
在同义词替换、知识需求、对抗性列名变更以及跨语言设置等方面分析数据集的鲁棒性。
提供案例研究，说明常见错误类型及潜在改进。

Figure 1: Example prompts for Text-to-SQL using ChatGPT. The prompt at the top is for a single-turn scenario, while the one below is for multi-turn scenarios where only new questions are added in each interaction.

实验结果

研究问题

RQ1与经过微调的 SOTA 模型相比，零-shot 的 ChatGPT 在标准 Text-to-SQL 基准上的表现如何？
RQ2对于同义词替换、额外知识需求以及对抗性列名变更等鲁棒性挑战，ChatGPT 的鲁棒性如何？
RQ3在多轮和跨语言的 Text-to-SQL 设置下，ChatGPT 的表现如何？
RQ4ChatGPT 会犯哪些错误，提升其零-shot Text-to-SQL 能力的实际方向是什么？

主要发现

ChatGPT 在零-shot Text-to-SQL 上表现强劲，在执行准确率方面与在 Spider 数据上训练的 SOTA 模型仅有 14% 的差距。
在某些鲁棒性场景下，ChatGPT 缩小甚至接近甚至超过 SOTA 方法（例如 ADVETA(RPL)，其超越 SOTA 4.1%）。
ChatGPT 展现出强鲁棒性，在鲁棒性基准上的差距小于在标准 Spider 数据集上的差距（例如，在某些 Spider 鲁棒性设置中差距为 7.8%）。
在多轮设置（SParC、CoSQL）中，ChatGPT 仍具竞争力，相对于单轮结果的差距更小，表明其上下文建模有效。
在中文 Text-to-SQL 数据集（CSpider、DuSQL）中，ChatGPT 表现良好，但当表名/列名也为中文时差距更大，突出跨语言泛化的挑战。
对于 SQL，基于精确匹配的评估是薄弱的，因此强调基于执行的度量；尽管语法不同，ChatGPT 的输出通常在语义上等价。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。