Skip to main content
QUICK REVIEW

[论文解读] Diffusion Probe: Generated Image Result Prediction Using CNN Probes

Benlei Cui, Bukun Huang|arXiv (Cornell University)|Feb 27, 2026
Generative Adversarial Networks and Image Synthesis被引用 0
一句话总结

Diffusion Probe 利用扩散模型早期交叉注意力图预测最终图像质量,从而实现早期质量评估和高效的下游优化,无需完整生成。

ABSTRACT

Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-generation scenarios such as prompt iteration, agent-based generation, and flow-grpo. We reveal a strong correlation between early diffusion cross-attention distributions and final image quality. Based on this finding, we introduce Diffusion Probe, a framework that leverages internal cross-attention maps as predictive signals. We design a lightweight predictor that maps statistical properties of early-stage cross-attention extracted from initial denoising steps to the final image's overall quality. This enables accurate forecasting of image quality across diverse evaluation metrics long before full synthesis is complete. We validate Diffusion Probe across a wide range of settings. On multiple T2I models, across early denoising windows, resolutions, and quality metrics, it achieves strong correlation (PCC > 0.7) and high classification performance (AUC-ROC > 0.9). Its reliability translates into practical gains. By enabling early quality-aware decisions in workflows such as prompt optimization, seed selection, and accelerated RL training, the probe supports more targeted sampling and avoids computation on low-potential generations. This reduces computational overhead while improving final output quality.Diffusion Probe is model-agnostic, efficient, and broadly applicable, offering a practical solution for improving T2I generation efficiency through early quality prediction.

研究动机与目标

  • 揭示扩散型 T2I 模型中早期阶段的交叉注意力模式与最终图像质量之间的联系。
  • 开发一个轻量级的基于 CNN 的探针,将新生注意力统计映射到最终图像质量分数。
  • 展示模型无关的适用性并在实际工作流中验证效率提升。

提出的方法

  • 从给定提示在早期去噪步骤的中间模型块中提取交叉注意力图。
  • 训练一个轻量级探针 E_theta,将注意力图与时间步嵌入映射到一个标量质量分数,使用与真实指标的均方误差(MSE)进行对比。
  • 将探针作为预测器在无需完整图像生成的情况下指导下游任务(提示优化、种子选择、RL 训练)。
  • 使用 SRCC、KTC、PCC 和 AUC-ROC 在多种 T2I 骨干模型上评估探针的准确性(如 SDXL、FLUX、Qwen-Image)。
  • 将探针应用于下游任务,通过筛选提示、选择种子或为 Flow-GRPO 训练提供奖励信号。
Figure 1 : Illustration of early cross-attention dispersion. Here, we present the prompt, the corresponding four cross-attention activation maps in the early denoising stage, and the final generated image. Compared to other tokens, the cross-attention activation maps of the “bird” token shows signif
Figure 1 : Illustration of early cross-attention dispersion. Here, we present the prompt, the corresponding four cross-attention activation maps in the early denoising stage, and the final generated image. Compared to other tokens, the cross-attention activation maps of the “bird” token shows signif

实验结果

研究问题

  • RQ1早期阶段的交叉注意力分布是否能在不同的 T2I 模型中预测最终图像质量?
  • RQ2在扩散过程的多早阶段,可以使用轻量级探针可靠地预测质量?
  • RQ3模型无关的探针是否能在无需完整生成的情况下实现高效的提示优化、种子选择与 RL 训练?

主要发现

Base ModelResolutionStepSRCCAUC-ROCKTCPCC
SDXL1024×102410.490.530.350.48
SDXL1024×102450.730.860.570.72
SDXL1024×1024100.760.890.610.75
SDXL1024×1024150.750.890.600.74
FLUX1024×102410.520.620.380.50
FLUX1024×102450.760.880.600.75
FLUX1024×1024100.790.910.640.78
FLUX1024×1024150.780.910.630.77
Qwen-Image1024×102410.450.670.320.44
Qwen-Image1024×102450.690.840.530.68
Qwen-Image1024×1024100.720.870.560.71
Qwen-Image1024×1024150.710.860.550.70
  • Diffusion Probe 在多样化模型和早期去噪步骤下表现出高预测准确性(SRCC、KTC、PCC)以及强的分类性能(AUC-ROC)。
  • 在 FLUX 模型上,探针在步骤 10 处达到峰值预测指标(SRCC 0.79,AUC 0.91,PCC 0.78)。
  • 探针对 SDXL 与 Qwen-Image 具有泛化能力,仍维持较高相关性(SRCC 约 0.72–0.76)和 AUC(>0.86)。
  • 在下游任务中,探针提升了提示优化和种子选择的指标,并与更重的基于LLM的方法相比具有竞争力,同时降低了计算成本。
  • 将探针集成到 Flow-GRPO 中通过用更高质量样本丰富批次来加速 RL 训练,改善收敛稳定性。
Figure 2 : Overview of the Diffusion Probe framework. Our framework takes as input the early-stage cross-attention feature maps (derived from the CrossAttn module at a probed timestep $t$ ) and the TimeStep Embedding . A lightweight network processes these inputs, ultimately outputting a quality sco
Figure 2 : Overview of the Diffusion Probe framework. Our framework takes as input the early-stage cross-attention feature maps (derived from the CrossAttn module at a probed timestep $t$ ) and the TimeStep Embedding . A lightweight network processes these inputs, ultimately outputting a quality sco

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。