QUICK REVIEW

[论文解读] VQPP: Video Query Performance Prediction Benchmark

Adrian Catalin Lutu, Eduard Poesina|arXiv (Cornell University)|Feb 19, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

VQPP 是内容基视频检索中查询性能预测的首个基准，评估两数据集和两种 CBVR 系统下的前后检索预测模型，并展示了一个使用大语言模型进行查询改写的用例。

ABSTRACT

Query performance prediction (QPP) is an important and actively studied information retrieval task, having various applications, such as query reformulation, query expansion, and retrieval system selection, among many others. The task has been primarily studied in the context of text and image retrieval, whereas QPP for content-based video retrieval (CBVR) remains largely underexplored. To this end, we propose the first benchmark for video query performance prediction (VQPP), comprising two text-to-video retrieval datasets and two CBVR systems, respectively. VQPP contains a total of 56K text queries and 51K videos, and comes with official training, validation and test splits, fostering direct comparisons and reproducible results. We explore multiple pre-retrieval and post-retrieval performance predictors, creating a representative benchmark for future exploration of QPP in the video domain. Our results show that pre-retrieval predictors obtain competitive performance, enabling applications before performing the retrieval step. We also demonstrate the applicability of VQPP by employing the best performing pre-retrieval predictor as reward model for training a large language model (LLM) on the query reformulation task via direct preference optimization (DPO). We release our benchmark and code at https://github.com/AdrianLutu/VQPP.

研究动机与目标

确立内容基视频检索（CBVR）中查询性能预测的首个标准化基准。
提供跨多样化视频数据集与检索系统的 ground-truth、划分与评测协议。
评估从语言特征到深度前检索与后检索模型的广泛预测模型。
通过使用 QPP 预测器作为奖励模型进行 LLM 的查询改写，展示一个实际应用场景。

提出的方法

从 MSR-VTT 与 VATEX 数据集构建 VQPP，总计 56K 条文本查询和 51K 条视频，在两种 CBVR 系统（GRAM 与 VAST）上进行评估。
提供四个评测情景（2 个数据集 × 2 个检索系统）以及可复现的检索结果和分数。
在前检索（语言基线、微调的 BERT、少样本的 Llama-3.1）与后检索（微调的 CLIP、CLIP4Clip、相关性 CNN）类别中训练并评估预测模型。
使用 Pearson ρ 与 Kendall τ 相关性来衡量 QPP 性能，即预测难度与真实检索指标之间的相关性；真实检索使用 Reciprocal Rank 与 Recall@K。
通过使用微调的 BERT QPP 预测器作为 Direct Preference Optimization (DPO) 的奖励模型，训练 Phi-4-mini-instruct 用于改写查询。

实验结果

研究问题

RQ1前检索预测是否能在 CBVR 系统间与后检索预测相媲美或优于后检索预测，以预测视频 QPP？
RQ2QPP 预测在两种不同视频数据集和两种检索模型间的泛化能力如何？
RQ3使用深度学习预测模型（如 BERT、CLIP）与传统语言特征相比，对 CBVR 中的 QPP 准确性有何影响？
RQ4QPP 预测器是否能有效引导查询改写以提升检索性能？

主要发现

在各情景中，基于深度前检索的预测在 VQPP 中持续优于后检索模型。
微调的 BERT 在所有评测情景与相关性度量中均达到最佳表现。
VATEX 的 QPP 相关性低于 MSR-VTT，暗示数据集的难度存在差异。
前检索预测对检索系统（GRAM vs VAST）的敏感性有限。
少样本的 Llama-3.1-8B 通过增加样本数量而提升，但在本基准中仍不及 BERT 的表现。
基于 CLIP 的后检索预测在本任务上不如简单的 CLIP 基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。