QUICK REVIEW

[论文解读] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Yuqing Wang, Chuofan Ma|arXiv (Cornell University)|Mar 19, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

CubiD 引入一种高维离散生成模型，在 3D h×w×d 张量上对标记进行屏蔽与预测，使得直接在高维表示令牌上进行高效生成成为可能，并在 ImageNet 256×256 上实现离散生成的最新成果。

ABSTRACT

Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.

研究动机与目标

证明高维表示令牌在不牺牲语义质量的前提下可以离散化以用于理解任务。
提出 CubiD，一种细粒度的立方屏蔽扩散模型，以高效生成高维离散令牌。
在 ImageNet-256 上以高维令牌跨多种编码器实现强扩展性和最先进的结果。

提出的方法

使用维度级量化对冻结编码器的高维特征进行离散化，得到 h×w×d 离散令牌。
在三维张量上进行细粒度的逐元素屏蔽训练，通过空间和维度轴随机屏蔽令牌并通过交叉熵预测。
使用在 h×w 维度上的 Transformer，具备双向注意力，每个维度 d 的令牌在并行中对每个空间位置预测所有 d 维。
推理阶段通过数百步的迭代未屏蔽生成，使用余弦调度，得到与 d 无关的 O(T) 迭代次数。
以 FID、IS 与多模态理解度量进行评估，以验证生成质量以及对表示语义的保留。

Figure 1 : Comparison of discrete visual generation approaches. (a) Low-dimensional token generation: Both methods operate at the spatial level—autoregressive requires $h\times w$ sequential steps, while discrete diffusion achieves parallel generation in $T<h\times w$ iterations. (b) High-dimensiona

实验结果

研究问题

RQ1高维表示令牌（768–1024 维）在不显著损失理解任务语义质量的情况下能否离散化？
RQ2是否可在一个在维度层级进行屏蔽的扩散框架下，高效建模并生成 h×w×d 离散令牌？
RQ3CubiD 是否能在模型规模和不同高维编码器上对 ImageNet 256×256 的生成实现有效扩展？

主要发现

维度级量化在重建和多模态理解任务上保持连续水平的语义质量。
在三维张量上的逐元素细粒度屏蔽是关键；逐维或逐空间屏蔽会显著降低质量。
CubiD 在数百次迭代（约 256–512 次）下保持强大的生成质量，与令牌维度无关，并且从 9.46 亿参数扩展到 37 亿参数仍具良好可扩展性。
CubiD 在 ImageNet 256×256 的离散生成上达到最先进的水平，针对高维令牌（XXL 模型）gFID 低至 1.88。
该方法在不同的表示编码器（DINOv2-B 与 SigLIP2-B）上具有泛化性，并保持令牌支持理解与生成任务的能力。

Figure 2 : Generated samples from CubiD. Class-conditional generation results on ImageNet 256×256 using high-dimensional representation tokens from DINOv2-B encoder, demonstrating fine details and textures across diverse categories.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。