QUICK REVIEW

[論文レビュー] To Case or Not to Case: An Empirical Study in Learned Sparse Retrieval

Emmanouil Georgios Lionis, Jia-Huei Ju|arXiv (Cornell University)|Jan 24, 2026

Information Retrieval and Search Behavior被引用数 0

ひとこと要約

本論文は Learned Sparse Retrieval (LSR) におけるバックボーンモデルの大文字小文字区別の有無を体系的に比較し、lowercasing が性能差を大幅に縮小し、適切な前処理を行えば大文字小文字区別ありのモデルも LSR に適用可能になることを示す。

ABSTRACT

Learned Sparse Retrieval (LSR) methods construct sparse lexical representations of queries and documents that can be efficiently searched using inverted indexes. Existing LSR approaches have relied almost exclusively on uncased backbone models, whose vocabularies exclude case-sensitive distinctions, thereby reducing vocabulary mismatch. However, the most recent state-of-the-art language models are only available in cased versions. Despite this shift, the impact of backbone model casing on LSR has not been studied, potentially posing a risk to the viability of the method going forward. To fill this gap, we systematically evaluate paired cased and uncased versions of the same backbone models across multiple datasets to assess their suitability for LSR. Our findings show that LSR models with cased backbone models by default perform substantially worse than their uncased counterparts; however, this gap can be eliminated by pre-processing the text to lowercase. Moreover, our token-level analysis reveals that, under lowercasing, cased models almost entirely suppress cased vocabulary items and behave effectively as uncased models, explaining their restored performance. This result broadens the applicability of recent cased models to the LSR setting and facilitates the integration of stronger backbone architectures into sparse retrieval. The complete code and implementation for this project are available at: https://github.com/lionisakis/Uncased-vs-cased-models-in-LSR

研究の動機と目的

バックボーンモデルの大文字小文字区別（cased vs. uncased）が、ドメイン内データセットおよびドメイン外データセットでの LSR の性能にどのように影響するかを評価する。
前処理としての lowercasing が、cased モデルの性能ギャップを埋めるかを判断する。
cased 対 uncased の LSR における効率と精度のトレードオフを、後処理戦略を評価して明らかにする。
cased 対 uncased の LSR モデルのゼロショット転送の頑健性を検証する。
現代的な cased バックボーンを LSR パイプラインで活用する際の実践的な指針を提供する。

提案手法

バックボーンの前処理をなしまたは lowercasing、後処理をなし、uncased Voc. only、または cased regularizer のいずれかで実装する。
SPLADE-style encoder を Margin-MSE teacher-student distillation で sparse 表現を学習する。
FLOPs ベースの正則化で疎性を最適化する。
MSMARCO、DL-2019、DL-2020、BEIR のベンチマークを MRR@10、nDCG@10、R@1000 指標で評価する。
異なる前処理条件下で出力のトークン大文字小文字分布を分析する。

Figure 1: Pipeline of cased models. Queries and documents first undergo a pre-processing step, followed by encoding, and then a post-processing step where sparse vectors are generated and compared. During post-processing, Cased Regularization is applied only during training as an additional loss.

実験結果

リサーチクエスチョン

RQ1RQ1: バックボーンモデルの大文字小文字区別は、ドメイン内外での LSR の性能にどのように影響するか。
RQ2RQ2: lowercasing の前処理は、cased モデルの性能を uncased モデルと同等に回復させるか。
RQ3RQ3: 後処理戦略は、LSR の精度低下を最小限に抑えつつ効率を改善できるか。
RQ4RQ4: バックボーンの大文字小文字区別は、データセット間のゼロショット転送の頑健性にどう影響するか。

主な発見

Uncased モデルは、前処理を適用しない場合、ドメイン内タスクで一般に cased モデルよりも高い性能を示す。
Lowering 前処理は、cased モデルのギャップを大幅に緩和し、uncased 性能に近づける（例: MSMARCO Dev での BERT-cased vs BERT-uncased）。
Post-processing（例: ロジットを uncased 語彙に制限するなど）は、主に効率を改善し、精度の損失は最小限にとどまる。
Uncased モデルは BEIR でのゼロショット転送に対してより頑健であるが、lowercasing を用いた cased モデルが特定の領域（例: NFCorpus、Quora）で競争力を持つ場合がある。
トークンレベルの分析では、cased 入力はしばしば uncased 出力へ対応し、lowercasing により uncased トークンのほぼ排他的な使用が強制され、性能回復を説明する。

Figure 2: Confusion matrices comparing input and output token casing across BERT and DistilBERT models under different pre-processing conditions. For both models, no post-processing method is used. Rows correspond to input token casing (cased vs. uncased), and columns represent the resulting output

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。