QUICK REVIEW

[论文解读] LLM-Based Test Case Generation in DBMS through Monte Carlo Tree Search

Yujia Chen, Yingli Zhou|arXiv (Cornell University)|Mar 23, 2026

Software Testing and Debugging Techniques被引用 0

一句话总结

MIST 使用轻量级大模型，并通过分层特征树和蒙特卡洛树搜索来为数据库管理系统生成 SQL 测试用例，从而在解析、优化、执行和存储等方面提升代码覆盖率。

ABSTRACT

Database Management Systems (DBMSs) are fundamental infrastructure for modern data-driven applications, where thorough testing with high-quality SQL test cases is essential for ensuring system reliability. Traditional approaches such as fuzzing can be effective for specific DBMSs, but adapting them to different proprietary dialects requires substantial manual effort. Large Language Models (LLMs) present promising opportunities for automated SQL test generation, but face critical challenges in industrial environments. First, lightweight models are widely used in organizations due to security and privacy constraints, but they struggle to generate syntactically valid queries for proprietary SQL dialects. Second, LLM-generated queries are often semantically similar and exercise only shallow execution paths, thereby quickly reaching a coverage plateau. To address these challenges, we propose MIST, an LLM-based test case generatIon framework for DBMS through Monte Carlo Tree search. MIST consists of two stages: Feature-Guided Error-Driven Test Case Synthetization, which constructs a hierarchical feature tree and uses error feedback to guide LLM generation, aiming to produce syntactically valid and semantically diverse queries for different DBMS dialects, and Monte Carlo Tree Search-Based Test Case Mutation, which jointly optimizes seed query selection and mutation rule application guided by coverage feedback, aiming at boosting code coverage by exploring deeper execution paths. Experiments on three widely-used DBMSs with four lightweight LLMs show that MIST achieves average improvements of 43.3% in line coverage, 32.3% in function coverage, and 46.4% in branch coverage compared to the baseline approach with the highest line coverage of 69.3% in the Optimizer module.

研究动机与目标

通过生成高质量的 SQL 测试用例，推动对 DBMS 的鲁棒性测试，涵盖多样的方言和更深的执行路径。
通过特征引导、错误驱动的合成阶段与基于 MCTS 的变异阶段，克服数据访问层/语义差距。
仅使用本地部署的轻量级大模型实现更高的代码覆盖率，覆盖 DBMS 各组件。
在不同 DBMS 架构和不同大小的大模型上通过大量实验展示泛化能力。

提出的方法

从官方 DBMS 文档构建分层特征树，以引导基于大模型的测试用例生成。
利用执行测试的错误反馈迭代改进大模型提示（特征引导、错误驱动的合成）。
应用蒙特卡洛树搜索对测试用例进行变异和扩展，并以覆盖反馈为引导，采用 135 条方言感知的变异规则。
将测试用例的成功视为隐式的预言机，关注语法有效性与运行时稳定性，而非功能正确性。
对生成的 SQL 进行后处理，确保可执行性并在执行前去除非 SQL 内容。

Figure 1 . An illustrative example of using Qwen2.5-7B to generate a SQL test case for DBMS.

实验结果

研究问题

RQ1RQ1：与基线相比，MIST 在提高 DBMS 代码覆盖率方面的效果如何？
RQ2RQ2：MIST 在不同的 DBMS 模块（解析器、优化器、执行器、存储）上的表现如何？
RQ3RQ3：两阶段（合成与变异）对覆盖率提升的贡献是什么？
RQ4RQ4：结果是否在不同大小的大模型和不同 DBMS 架构上具有泛化性？

主要发现

与基线相比，MIST 在行覆盖率提高 43.3%、函数覆盖率提高 32.3%、分支覆盖率提高 46.4%。
MIST 在优化器模块达到最高行覆盖率：DuckDB 69.3%、PostgreSQL 63.4%。
在 DuckDB、PostgreSQL、SQLite 以及所有四种评估的大模型中都观察到提升，其中较小的模型甚至超越了较大基线模型。
模块层面在 Parser、Optimizer、Executor、Storage 四个模块上对 DuckDB 与 PostgreSQL 均有显著的提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。