QUICK REVIEW

[论文解读] Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement

Xin Zhang, Jianyang Xu|arXiv (Cornell University)|Mar 25, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

TMKD 使用双模态教师（多视图视觉教师与 CLIP 作为文本教师）引导 RGB、边缘和高频视图的自适应融合，在五个数据集上相对于基线的提升高达 4.49%。

ABSTRACT

Knowledge distillation transfers knowledge from large teacher models to smaller students for efficient inference. While existing methods primarily focus on distillation strategies, they often overlook the importance of enhancing teacher knowledge quality. In this paper, we propose Text-guided Multi-view Knowledge Distillation (TMKD), which leverages dual-modality teachers, a visual teacher and a text teacher (CLIP), to provide richer supervisory signals. Specifically, we enhance the visual teacher with multi-view inputs incorporating visual priors (edge and high-frequency features), while the text teacher generates semantic weights through prior-aware prompts to guide adaptive feature fusion. Additionally, we introduce vision-language contrastive regularization to strengthen semantic knowledge in the student model. Extensive experiments on five benchmarks demonstrate that TMKD consistently improves knowledge distillation performance by up to 4.49\%, validating the effectiveness of our dual-teacher multi-view enhancement strategy. Code is available at https://anonymous.4open.science/r/TMKD-main-44D1.

研究动机与目标

通过提升教师知识质量来推动知识蒸馏，而不仅仅是蒸馏策略。
利用双模态教师通过视觉先验和语义引导提供更丰富的监督信号。
通过多视图输入（RGB、边缘增强和高频）与基于 CLIP 的提示实现自适应融合，从而提升视觉教师。
引入视觉-语言对比正则化以使学生表示与文本嵌入保持对齐。
在五个基准数据集上展示对当前最新 KD 方法的一致性改进。

提出的方法

从单张 RGB 图像构建多视图输入：RGB、边缘增强和高频视图。
使用一个共享的视觉教师提取所有视图的特征，并用由 CLIP 基于先验提示生成的语义权重进行融合。
在特征层进行带扰动的蒸馏、KL 散度蒸馏，在 logits 级别进行软化输出的蒸馏，以及使用 CLIP 文本嵌入作为语义锚点的文本引导对比表示蒸馏（CRD）。
以联合损失训练：L_all = alpha * L_logit + beta * L_CRD + gamma * L_feat。
使用特定视图的提示如“a photo of a {class}”、“an edge enhanced image of a {class}”、“a high-frequency enhanced image of a {class}”来引导基于 CLIP 的融合。

实验结果

研究问题

RQ1双教师 KD 框架是否能够通过结合视觉先验与语义引导来提升学生学习？
RQ2在 CLIP 提示引导下对多视图特征进行自适应融合是否比简单平均得到更好的表征？
RQ3特征层、logits 层和文本引导对比损失对蒸馏性能的影响如何？

主要发现

TMKD 将最先进的 KD 方法提升至最高 4.49%（在 CUB-200 上与不同的教师–学生对比）。
在五个数据集上均观察到稳定的增益（CIFAR-100、RAF-DB、DTD、Stanford Dogs、CUB-200）。
多视图输入加上自适应融合优于仅使用 RGB 或简单平均，其中 RGB+Edge+HF 在 CUB-200 上达到最佳消融增益 2.82%。
三种蒸馏损失（L_logit、L_feat、L_CRD）均对最佳性能有贡献；将它们结合可获得最高准确率。
TMKD 展现出对体系结构的泛化能力，在与 CATKD 或 TeKAP 配对时，能提升基于 ResNet 的以及轻量化的 VGG8 学生模型的表现。
可视化分析显示相较于仅 CATKD，CATKD+TMKD 的注意力更聚焦，特征分布更紧凑。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。