QUICK REVIEW

[论文解读] CBR-to-SQL: Rethinking Retrieval-based Text-to-SQL using Case-based Reasoning in the Healthcare Domain

Hung T. Nguyen, Hans Moen|arXiv (Cornell University)|Mar 5, 2026

Biomedical Text Mining and Ontologies被引用 0

一句话总结

CBR-to-SQL 引入一个两阶段检索框架，使用带掩码的案例模板和独立的实体对接步骤，将自然语言问题翻译为医疗保健领域的 SQL，达到最先进的逻辑形式准确性和鲁棒性泛化，尤其在数据稀缺时表现出色。

ABSTRACT

Extracting insights from Electronic Health Record (EHR) databases often requires SQL expertise, creating a barrier for healthcare decision-making and research. While a promising approach is to use Large Language Models (LLMs) to translate natural language questions to SQL via Retrieval-Augmented Generation (RAG), adapting this approach to the medical domain is non-trivial. Standard RAG relies on single-step retrieval from a static pool of examples, which struggles with the variability and noise of medical terminology and jargon. This often leads to anti-patterns such as expanding the task demonstration pool to improve coverage, which in turn introduces noise and scalability problems. To address this, we introduce CBR-to-SQL, a framework inspired by Case-Based Reasoning (CBR). It represents question-SQL pairs as reusable, abstract case templates and utilizes a two-stage retrieval process that first captures logical structure and then resolves relevant entities. Evaluated on MIMICSQL, CBR-to-SQL achieves state-of-the-art logical form accuracy and competitive execution accuracy. More importantly, it demonstrates higher sample efficiency and robustness than standard RAG approaches, particularly under data scarcity and retrieval perturbations.

研究动机与目标

在 SQL 专业知识成为获取 EHR 数据障碍时，提升医疗领域的文本到 SQL 的动力。
提出一个将逻辑结构检索与实体对接分离的基于案例推理（CBR）框架。
展示带掩码的案例模板能够实现可重用模式、提升样本效率和鲁棒性。

提出的方法

通过屏蔽与模式保留来将问题-SQL 对转化为抽象的案例模板（Case Retain）。
在推理阶段，检索相似的掩码模板（Template Construction）并使用大语言模型生成草拟的 SQL 模板。
将占位实体与基于架构的检索表进行对接（Source Discovery），并填充具体的架构实体以生成可执行的 SQL。
使用两阶段检索方法，将逻辑结构与实体解析分离，利用不同数据源来获取模式和 EHR 对接。
在 MIMICSQL 的 Complete 与 Incomplete 数据库设置下，以及脆弱性度量，评估对检索案例的敏感性。

Figure 1: Overview of the CBR-to-SQL architecture.

实验结果

研究问题

RQ1两阶段、基于案例的检索框架在医疗文本到 SQL 任务上，与标准的 RAG 相比表现如何？
RQ2在数据稀缺与检索扰动下，屏蔽与模板化是否提升样本效率与鲁棒性？
RQ3将逻辑结构检索与实体对接解耦对准确性与脆弱性有何影响？
RQ4相较于基于 RAG 的基线，CBR-to-SQL 在不完整数据情境下的泛化性能如何？

主要发现

Method	Acc_EX	Acc_LF
SQLNet	0.260	0.142
PtrGen	0.292	0.180
Coarse2Fine	0.378	0.496
TREQS	0.654	0.556
RAG-to-SQL	0.855	0.811
CBR-to-SQL	0.882	0.828
MedTS	0.899	0.784
GE-SQL	0.942	–

在完备数据库设定下，CBR-to-SQL 的 Acc_EX 与 Acc_LF 均高于 RAG-to-SQL（Acc_EX：0.882 对 0.855；Acc_LF：0.828 对 0.811）。
CBR-to-SQL 在 MIMICSQL 上达到最先进的 Acc_LF，并在 Acc_EX 上具竞争力，优于若干基线（如 TREQS、Coarse2Fine、PtrGen、SQLNet）。
在不完整数据库设定下，CBR-to-SQL 相对于 RAG-to-SQL 仍保持明显领先（Acc_EX：0.842 对 0.777；Acc_LF：0.780 对 0.747）。
CBR-to-SQL 在两种设置（Complete 与 Incomplete）下对脆弱性表现低于 RAG-to-SQL（如 Δbrittle_EX：0.047 对 0.065 在 CDB；Δbrittle_EX 在 IDB：0.049 对 0.068）。
消融研究显示，移除 Source Discovery 会大幅降低性能，而基于掩码的 Template Construction 则对噪声具有韧性。

Figure 2: CBR cycle adapted from aamodt1994case .

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。