QUICK REVIEW

[论文解读] One-step Latent-free Image Generation with Pixel Mean Flows

Yiyang Lu, Susie Lu|arXiv (Cornell University)|Jan 29, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

本论文提出像素 MeanFlow (pMF)，一种一步式、无潜在表示的图像生成器，从嘈杂输入输出去噪像素预测 x，并学习对应的平均速度 u 来回归瞬时速度 v，在 ImageNet 上的 FID 分数具有竞争力且无需潜在表示。

ABSTRACT

Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose "pixel MeanFlow" (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.

研究动机与目标

激励并开发一种一步式无潜在表示的图像生成方法，该方法省略潜在标记和多步采样。
提出在低维图像流形上的预测目标（x-预测）并结合基于速度的损失空间（v-损失）来训练神经网络。
建立均值流概念（u）与去噪图像样场（x）之间的桥梁，以实现端到端的像素空间生成。
在高分辨率 ImageNet（256×256 与 512×512）上验证 pMF 的可行性与性能且不使用潜在表示。

提出的方法

将 x(z_t, r, t) = z_t − t · u(z_t, r, t) 定义为在 (r, t) 时间网格上的去噪图像样场。
用神经网络预测 x，并通过 u = (z_t − x)/t 得到 u，然后 V = u + (t − r) · JVP_sg，在训练中用于 v-损失的计算。
优化 pMF 目标 L_pMF = E[ ||V_θ − v||^2 ]，其中 v 是瞬时速度，使 x-预测与速度空间监督对齐。
可选地在 x_θ 上加入感知损失以提高视觉保真度，得到 L = L_pMF + λ L_perc，并通过阈值 t_thr 控制模糊度。
采用 Muon 优化器以实现更快收敛，并比较在 (r, t) 空间上对 x-预测目标的前置条件、时间采样器，以及高分辨率设置（256×256、512×512、1024×1024）的消融。
通过改变模型深度/宽度和训练 epochs 的规模，报告在像素空间生成下 1-NFE 的 FID/IS，展示可扩展性。

实验结果

研究问题

RQ1是否可以通过在像素空间预测去噪样场 x 而不是速度场 u 或直接的 x_hat，从而实现一步 latent-free 图像生成？
RQ2通过物理类关系推导得到的 x 的预测及 u、v 是否可学习且在高维像素空间中训练稳定？
RQ3在 ImageNet 的 256×256 与 512×512 上，pMF 相较于之前的一步/基于潜在的方法在 FID 与速度方面有何比较？
RQ4感知损失、优化器与时间采样策略对 pMF 的质量与收敛性有何影响？

主要发现

pMF 在 ImageNet 256×256 上实现 2.22 FID，在 512×512 上实现 2.48 FID，均在像素空间且 1-NFE。
在高维像素空间生成中，预测 x（去噪图像样场）至关重要，而在维度较高时预测 u 会降低性能。
将感知损失（LPIPS）引入后，在 256×256 上对 FID 的提升显著：从 9.56 提升到 5.62（VGG），进一步提升到 3.53（ConvNeXt-V2），显示感知监督的强大收益。
Muon 优化器相较于 Adam，在此一步设定下能加速收敛并提升 FID。
高分辨率实验（256/512/1024）表明 pMF 能在确保 1-NFE 的同时维持有竞争力的 FID，显示良好的可扩展性。
表格对比显示，pMF 无潜在空间的像素空间生成在 FID 上可与若干潜在空间扩散/流动基线竞争，且计算特性更有利。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。