QUICK REVIEW

[论文解读] Multi-Step Semantic Reasoning in Generative Retrieval

Steven Dong, Yubao Tang|arXiv (Cornell University)|Mar 12, 2026

Information Retrieval and Search Behavior被引用 0

一句话总结

ReasonGR通过使用结构化提示和推理适配器增强生成检索中的多步语义推理，提高FinQA检索准确性和训练效率。

ABSTRACT

Generative retrieval (GR) models encode a corpus within model parameters and generate relevant document identifiers directly for a given query. While this paradigm shows promise in retrieval tasks, existing GR models struggle with complex queries in numerical contexts, such as those involving semantic reasoning over financial reports, due to limited reasoning capabilities. This limitation leads to suboptimal retrieval accuracy and hinders practical applicability. We propose ReasonGR, a framework designed to enhance multi-step semantic reasoning in numerical contexts within GR. ReasonGR employs a structured prompting strategy combining task-specific instructions with stepwise reasoning guidance to better address complex retrieval queries. Additionally, it integrates a reasoning-focused adaptation module to improve the learning of reasoning-related parameters. Experiments on the FinQA dataset, which contains financial queries over complex documents, demonstrate that ReasonGR improves retrieval accuracy and consistency, indicating its potential for advancing GR models in reasoning-intensive retrieval scenarios.

研究动机与目标

在查询需要对复杂文档进行多步数值推理时，提升检索性能的动机。
提出ReasonGR框架，将结构化提示与逐步推理引导相结合。
引入聚焦推理的自适应模块，以高效学习与推理相关的参数。
在FinQA数据集上展示相对于基线生成检索方法的改进。

提出的方法

利用基于Transformer的编码器-解码器骨干网络进行生成检索，并采用LoRA型推理适配器。
应用4-bit QLoRA对冻结骨干进行量化，降低内存使用。
设计带有任务模板和Chain-of-Thought指令的推理引导训练。
以两项任务进行训练：通过MLE对文档ID进行记忆，以及学习带推理痕迹的多步相关性。
使用自适应惩罚缩放损失，结合EM、PM、SM和S-Score信号，对标记级预测进行 supervision。

Figure 1: ReasonGR performing multi-step semantic reasoning on a FinQA query. The model extracts key info and locates relevant report sections to generate the docid, formed by the company name and report year.

实验结果

研究问题

RQ1结构化提示（包括少样本与CoT）是否能在对金融文档的生成检索中提升多步推理？
RQ2使用LoRA/QLoRA的推理适配器是否能提升推理密集型任务的检索准确性与训练效率？
RQ3与传统检索和原始GR基线相比，ReasonGR在FinQA数据集上的表现如何？
RQ4提示设计（Zero vs CoT vs 完整ReasonGR）对性能和效率有何影响？

主要发现

Model	EM (Eval)	PM (Eval)	SM (Eval)	EM (Test)	PM (Test)	SM (Test)
BM25	0.623	-	-	0.625	-	-
DSI	0.563	0.646	0.651	0.578	0.654	0.659
ReasonGR (Zero)	0.572	0.732	0.748	0.601	0.750	0.767
ReasonGR (CoT)	0.571	0.728	0.748	0.612	0.755	0.774
ReasonGR	0.607	0.751	0.765	0.626	0.762	0.779

ReasonGR变体在FinQA评估集与测试集上在EM、PM、SM指标上均优于基线（BM25、DSI）。
完整的ReasonGR在PM和SM分数上达到最佳，并相对于BM25提升EM。
提示训练（Few-shot + CoT）有益；无提示（Zero）时性能下降。
CoT仅提示获得中等增益，受少样本提示的叠加带来收益。
ReasonGR在提示设置下具备较好的训练效率，内存使用相当，训练时间根据提示设置有所下降。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。