QUICK REVIEW

[论文解读] WARM-CAT: Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

Xudong Yan, Songhe Feng|arXiv (Cornell University)|Feb 26, 2026

Domain Adaptation and Few-Shot Learning被引用 0

一句话总结

WARM-CAT 使用测试时无监督数据来逐步累积多模态知识并更新文本与视觉原型，解决 CZSL 中的标签分布偏移，采用动态优先队列与自适应更新，以及新增数据集与评估。

ABSTRACT

Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual prototypes from historical images for inference. Since the model tends to favor compositions already stored in the queue during testing, we warm-start the queue by initializing it with training images for visual prototypes of seen compositions and generating unseen visual prototypes using the mapping learned between seen and unseen textual prototypes. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. To provide a more reliable evaluation for CZSL, we introduce a new benchmark dataset, C-Fashion, and refine the widely used but noisy MIT-States dataset. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. The source code and datasets are available at https://github.com/xud-yan/WARM-CAT .

研究动机与目标

通过解决因未见属性-对象组合导致的测试时标签空间分布偏移来激发 CZSL 的研究动机。
开发一个测试时知识积累框架，利用未标注数据中的文本和视觉模态。
引入自适应机制在更新原型的同时缓解遗忘与延迟。
提供一个新的时尚领域 CZSL 基准（C-Fashion）并对 MIT-States 进行改进以实现公平评估。
在闭域和开放域设置下，在多个 CZSL 基准上展示最先进的性能。

提出的方法

基于 CLIP 的基础模型，使用提示调优来获得文本原型，以及通过适配器微调的视觉编码器。
通过冻结文本编码器，从已 seen/ unseen 组合构建文本原型。
维护一个高置信度测试图像的动态优先队列，以为每个组合形成视觉原型。
引入知识积累模块（KAMs），具有自适应更新权重，在在线环境中更新文本和视觉原型。
通过将已 seen 文本原型映射至 unseen 进行所见-未见文本映射，使用基于余弦相似度的映射矩阵来生成未见视觉原型。
在测试时最小化预测熵，并应用多模态协作表示学习以对齐文本与视觉原型。
端到端优化，结合熵最小化和文本-视觉原型之间的对比学习，出于效率考虑推迟反向传播。

Figure 1: At test time , existing methods ( top ) fail to adapt using test images, resulting in biased prediction distributions due to label space shift. By contrast, WARM-CAT ( bottom ) progressively accumulates multimodal knowledge from unsupervised test data, enabling effective adaptation to addr

实验结果

研究问题

RQ1测试时未标注数据是否能在不遗忘已见组合的前提下，帮助缩小 CZSL 的标签分布差距？
RQ2在测试期间如何有效地结合文本与视觉原型并自适应更新？
RQ3高置信度视觉示例队列在分布偏移下提升 CZSL 的作用是什么？
RQ4通过文本到视觉的映射生成的未见视觉原型是否能提升开放世界 CZSL 的性能？
RQ5所提出的度量与基准在长尾 CZSL 设置下如何体现性能？

主要发现

在四个 CZSL 基准上在闭域和开放域设置均达到最先进水平。
通过测试时无监督知识积累有效处理标签分布偏移。
受益于暖启动的优先队列和通过文本到视觉映射生成的未见视觉原型。
在新的 C-Fashion 数据集和改进的 MIT-States∗ 数据集上验证该方法，并对长尾 CZSL 分布进行了专门评估。

Figure 2: Prompt tuning of the text encoder and adapter tuning of the visual encoder during training.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。