QUICK REVIEW

[论文解读] To Case or Not to Case: An Empirical Study in Learned Sparse Retrieval

Emmanouil Georgios Lionis, Jia-Huei Ju|arXiv (Cornell University)|Jan 24, 2026

Information Retrieval and Search Behavior被引用 0

一句话总结

本文系统比较了用于 Learned Sparse Retrieval (LSR) 的有大小写(backbone)模型与未大小写(backbone)模型，并展示小写处理在很大程度上缩小了性能差距，使在适当预处理下有大小写的模型也可用于 LSR。

ABSTRACT

Learned Sparse Retrieval (LSR) methods construct sparse lexical representations of queries and documents that can be efficiently searched using inverted indexes. Existing LSR approaches have relied almost exclusively on uncased backbone models, whose vocabularies exclude case-sensitive distinctions, thereby reducing vocabulary mismatch. However, the most recent state-of-the-art language models are only available in cased versions. Despite this shift, the impact of backbone model casing on LSR has not been studied, potentially posing a risk to the viability of the method going forward. To fill this gap, we systematically evaluate paired cased and uncased versions of the same backbone models across multiple datasets to assess their suitability for LSR. Our findings show that LSR models with cased backbone models by default perform substantially worse than their uncased counterparts; however, this gap can be eliminated by pre-processing the text to lowercase. Moreover, our token-level analysis reveals that, under lowercasing, cased models almost entirely suppress cased vocabulary items and behave effectively as uncased models, explaining their restored performance. This result broadens the applicability of recent cased models to the LSR setting and facilitates the integration of stronger backbone architectures into sparse retrieval. The complete code and implementation for this project are available at: https://github.com/lionisakis/Uncased-vs-cased-models-in-LSR

研究动机与目标

评估 backbone 模型大小写（有大小写 vs 未大小写）在同域和跨域数据集上对 LSR 性能的影响。
确定将小写化作为预处理步骤是否能缓解有大小写模型的性能差距。
评估有大小写与未大小写 LSR 的后处理策略在效率与准确性之间的权衡。
考察有大小写与未大小写 LSR 模型的零-shot 迁移鲁棒性。
提供在 LSR 流水线中使用现代有大小写的主干模型的实用指导。

提出的方法

对有大小写与未大小写的主干模型实现预处理（无或小写化）和后处理（无、仅未大小写词汇、或大小写正则器）。
使用 SPLADE 风格的编码器，结合 Margin-MSE 教师-学生蒸馏来训练稀疏表示。
使用基于 FLOPs 的正则化来优化稀疏性。
在 MSMARCO、DL-2019、DL-2020 和 BEIR 基准上使用 MRR@10、nDCG@10 和 R@1000 指标进行评估。
在不同预处理条件下分析输出中的词元大小写分布。

Figure 1: Pipeline of cased models. Queries and documents first undergo a pre-processing step, followed by encoding, and then a post-processing step where sparse vectors are generated and compared. During post-processing, Cased Regularization is applied only during training as an additional loss.

实验结果

研究问题

RQ1RQ1： backbone 模型大小写对 LSR 在同域内外的性能有何影响？
RQ2RQ2：小写化预处理能否使有大小写模型的性能恢复到与未大小写模型相匹配？
RQ3RQ3：后处理策略是否在尽量减少精度损失的前提下提升 LSR 的效率？
RQ4RQ4：主干模型的大小写对跨数据集的零-shot 迁移鲁棒性有何影响？

主要发现

在未应用任何预处理时，未大小写模型在同域任务上通常优于有大小写模型。
小写化预处理在很大程度上缩小了有大小写模型之间的差距，使其性能接近未大小写模型（例如 MSMARCO Dev 上的 BERT-cased 与 BERT-uncased）。
后处理（如将 logits 限制在未大小写词汇表内）主要在提高效率方面带来较小的精度损失。
未大小写模型在 BEIR 上的零-shot 迁移更具鲁棒性，尽管在某些领域（如 NFCorpus、Quora）小写化的有大小写模型也具有竞争力。
在令牌级分析中，带大小写的输入往往映射到未大小写的输出，而小写化几乎只使用未大小写的词元，从而解释了性能的恢复。

Figure 2: Confusion matrices comparing input and output token casing across BERT and DistilBERT models under different pre-processing conditions. For both models, no post-processing method is used. Rows correspond to input token casing (cased vs. uncased), and columns represent the resulting output

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。