QUICK REVIEW

[论文解读] SQL-Commenter: Aligning Large Language Models for SQL Comment Generation with Direct Preference Optimization

Lei Yu, Peng Wang|arXiv (Cornell University)|Mar 19, 2026

Natural Language Processing Techniques被引用 0

一句话总结

SQL-Commenter 通过在 LLaMA-3.1-8B 上结合持续预训练、有监督微调和直接偏好优化（DPO），显著提升 SQL 注释生成性能，在 Spider 与 Bird 基准测试上取得最先进结果，并在人类评估中超越基线。

ABSTRACT

SQL query comprehension is a significant challenge due to complex syntax, diverse join types, and deep nesting. Many queries lack adequate comments, severely hindering code readability, maintainability, and knowledge transfer. Automated SQL comment generation faces two main challenges: limited datasets that inadequately represent complex real-world queries, and Large Language Models' (LLMs) insufficient understanding of SQL-specific semantics. Our empirical analysis shows that even after continual pre-training and supervised fine-tuning, LLMs struggle with complex SQL semantics, yielding inaccurate comments. To address this, we propose SQL-Commenter, an advanced method based on LLaMA-3.1-8B. We first construct a comprehensive dataset of complex SQL queries with expert-verified comments. Next, we perform continual pre-training on a large SQL corpus to enhance the LLM's syntax and semantic understanding, followed by supervised fine-tuning. Finally, we introduce Direct Preference Optimization (DPO) using human feedback. SQL-Commenter utilizes a preference-based loss function to favor preferred outputs, enhancing fine-grained semantic learning and context-dependent quality assessment. Evaluated on the Spider and Bird benchmarks, SQL-Commenter significantly outperforms state-of-the-art baselines. On average, it surpasses the strongest baseline (Qwen3-14B) by 9.29, 4.99, and 13.23 percentage points on BLEU-4, METEOR, and ROUGE-L, respectively. Moreover, human evaluation demonstrates the superior quality of comments generated by SQL-Commenter in terms of correctness, completeness, and naturalness.

研究动机与目标

解决在现实世界分析和遗留系统中对复杂查询生成高质量、技术上准确的 SQL 注释的挑战。
创建一个来自 Spider 和 Bird 基准的复杂 SQL 查询的高质量、专家验证的数据集，并附带详细注释。
通过对大规模 SQL 语料库的持续预训练，提升对 SQL 语法/语义的理解。
通过对专门的 SQL 注释数据集进行有监督微调和基于人类反馈的直接偏好优化，提升注释质量。
在自动评测与人工评估中展示最先进的性能，并公开发布数据集与代码。

提出的方法

构建一个来自 Spider 和 Bird 基准的、由专家验证的包含详细注释的庞大数据集。
对 LLaMA-3.1-8B 进行 ~120 万条 SQL 查询及通用领域数据的持续预训练（CPT），以提升对 SQL 的理解。
对 ~15,071 对 ⟨SQL, Comment⟩ 进行有监督微调（SFT），以教授详细、技术上准确的解释。
引入直接偏好优化（DPO），使用偏好/非偏好注释对以使输出与开发者偏好对齐。
采用两阶段数据构建：（1）通过 DeepSeek-V3.1 生成机器辅助注释并经专家 refine；（2）通过多策略负采样为 DPO 构建偏好/被拒注释对。
在 Spider 和 Bird 基准上使用 BLEU-4、METEOR、ROUGE-L 进行评测，并进行对正确性、完整性和自然性的人工评估。

Figure 1. The Motivation of our SQL-Commenter.

实验结果

研究问题

RQ1RQ1：在 Spider 与 Bird 上，与最先进基线相比，SQL-Commenter 在 SQL 注释生成的表现如何？
RQ2RQ2：CPT、SFT 与 DPO 这几个组件各自及组合的效果如何？
RQ3RQ3：人工评估者在正确性、完整性和自然性方面如何评价生成的注释？
RQ4RQ4：常见的失败模式及潜在改进方向是什么？

主要发现

SQL-Commenter 在 Spider 与 Bird 的开发集/测试集上相较强基线取得了最先进的自动评测分数。
在 Spider 开发集上，达到 BLEU-4 36.95、METEOR 58.37、ROUGE-L 57.17，显著优于最强基线。
在 Spider 测试集上，达到 BLEU-4 36.37、METEOR 57.76、ROUGE-L 56.48，相较基线有明显提升。
在 Bird 开发集上，达到 BLEU-4 35.09、METEOR 55.91、ROUGE-L 56.74，相较基线有较大优势。
人工评估显示 SQL-Commenter 的注释在正确性、完整性及自然性方面优于基线。
首次将直接偏好优化（DPO）引入 SQL 注释生成，并公开发布数据集与代码。

Figure 2. The Overview of our SQL-Commenter.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。