QUICK REVIEW

[论文解读] OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation

Hui Fang, Shuai Zhang|arXiv (Cornell University)|Mar 17, 2026

Information Retrieval and Search Behavior被引用 0

一句话总结

OPERA 为密集检索模型微调引入动态数据裁剪（DP），解决质量-覆盖度权衡并在排序（NDCG@10）与检索（Recall@20）方面取得更好表现，并实现更快收敛，还扩展到基于大型语言模型的检索器。

ABSTRACT

Domain-specific finetuning is essential for dense retrievers, yet not all training pairs contribute equally to the learning process. We introduce OPERA, a data pruning framework that exploits this heterogeneity to improve both the effectiveness and efficiency of retrieval model adaptation. We first investigate static pruning (SP), which retains only high-similarity query-document pairs, revealing an intrinsic quality-coverage tradeoff: ranking (NDCG) improves while retrieval (Recall) can degrade due to reduced query diversity. To resolve this tradeoff, we propose a two-stage dynamic pruning (DP) strategy that adaptively modulates sampling probabilities at both query and document levels throughout training, prioritizing high-quality examples while maintaining access to the full training set. Evaluations across eight datasets spanning six domains demonstrate the effectiveness of both approaches: SP improves ranking over standard finetuning (NDCG@10 +0.5\%), while DP achieves the strongest performance on both ranking (NDCG@10 +1.9\%) and retrieval (Recall@20 +0.7\%), with an average rank of 1.38 across all methods. These findings scale to Qwen3-Embedding, an LLM-based dense retriever, confirming architecture-agnostic benefits. Notably, DP reaches comparable performance in less than 50\% of the training time required by standard finetuning.

研究动机与目标

识别数据质量如何影响检索模型微调，并揭示标准训练与裁剪式训练之间的固有权衡。
开发一个裁剪框架（SP 与 DP），在不牺牲召回率的前提下提升密集检索器的排序效果。
在多领域和多架构下演示 SP 与 DP 的有效性与效率。
展示动态裁剪可以与固定迭代训练结合，并在优先考虑信息样本的同时保持数据覆盖。

提出的方法

通过保留高相似度的查询-文档对来研究静态裁剪（SP），并分析质量-覆盖度权衡。
提出带分层查询/文档采样和余弦分级阈值的动态裁剪（DP），以柔性调控采样概率。
在训练过程中自适应更新查询与文档的采样概率，确保对全部数据的访问。
提供关于何时裁剪有帮助的理论洞见，包括关于真阳性信号强于噪声的正式结果（Theorem 1）。
在八个数据集、六个领域和两种架构（encoder-only BGE 与 decoder-based Qwen3-Embedding）上评估 OPERA。
将 DP 与基线微调和其他裁剪方法进行比较，包括分层裁剪的消融分析和效率分析。

Figure 1: Comparison of sampling probability distributions across training strategies. Left three panels: Standard finetuning (FT) samples all data pairs uniformly, while static pruning (SP) discards the lowest-similarity ones and up-weights the rest, improving ranking but reducing query coverage. D

实验结果

研究问题

RQ1与标准微调和裁剪基线相比，OPERA 是否在多样化领域同时提升排序与检索指标？
RQ2OPERA 的发现是否可推广至基于大语言模型的密集检索器（LLM-嵌入架构）？
RQ3OPERA 对嘈杂训练数据的鲁棒性如何，以及对收敛速度和效率有何影响？
RQ4OPERA 的计算开销是多少，如何优化？
RQ5动态裁剪如何在时间维度上对查询和文档的训练重点进行分配？

主要发现

静态裁剪可以提升排序（NDCG），但可能因查询多样性下降而降低检索（Recall）。
动态裁剪（DP）保持对全部数据的访问，且自适应地聚焦于高质量样例，提升整体性能。
DP 在大多数数据集和架构上都在排序（NDCG@10）与检索（Recall@20）上取得最强结果。
与标准微调相比，DP 的收敛时间平均缩短一半。
DP 对基于LLM的检索器（Qwen3-Embedding-0.6B）同样有效，表明该方法的架构无关性。
在嘈杂数据场景下，SP 在检索方面可能优于 DP，但两阶段先 SP 再 DP 能获得最佳召回，而单独 DP 在排序方面也有显著提升。

Figure 2: Training efficiency on ANTIQUE (unseen) and FEVER (seen). RP and SP use retention rate $k{=}0.25$ .

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。