[論文レビュー] Rethinking Vector Field Learning for Generative Segmentation
The paper analyzes why vanilla flow matching hurts diffusion-based segmentation and introduces FlowSeg, which reshapes the vector field with a distance-aware correction and pixel-level end-to-end decoding to achieve stronger segmentation performance, narrowing the gap with discriminative models.
Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.
研究の動機と目的
- Identify optimization mismatches between continuous diffusion flow and discrete segmentation tasks.
- Diagnose issues of gradient vanishing and trajectory traversing in flow matching for segmentation.
- Propose a vector field reshaping strategy with a distance-aware correction to improve convergence and class separation.
- Introduce a quasi-random centroid encoding and pixel neural field decoding for end-to-end training.
- Demonstrate performance gains on high-cardinality segmentation benchmarks.
提案手法
- Analyze gradient dynamics of standard flow matching and pinpoint gradient vanishing and lack of repulsion between classes.
- Introduce a distance-aware potential field Phi that yields a discriminative correction to the velocity, integrating it into a reshaped target velocity tilde{v}_t via stop-gradient.
- Develop a Kronecker-sequence inspired, quasi-random centroid encoding to place N categories deterministically in [-1,1]^3 with good inter-class spacing.
- Employ end-to-end pixel neural field decoding to map patch features to pixelwise velocity fields without relying on VAEs, enabling pixel-level segmentation alignment.
- Train with L_res loss that uses sg[tilde{v}_t] to preserve stability while injecting discriminative guidance.
- Optionally describe training stages, data augmentation, and optimization setup (AdamW, REPA) used in experiments.
実験結果
リサーチクエスチョン
- RQ1How does the standard flow matching objective affect optimization dynamics in generative segmentation?
- RQ2Can a distance-aware correction term introduce repulsive forces to improve class separation and mitigate gradient vanishing?
- RQ3Does a pixel-level end-to-end decoding pipeline improve alignment to pixel-wise segmentation targets compared to latent-space methods?
- RQ4Is a quasi-random centroid encoding sufficient to stabilize high-cardinality segmentation in a diffusion framework?
主な発見
| Method | Backbone | Pretrain Data | mIoU |
|---|---|---|---|
| DeeplabV3+ | ResNet101 | IN-1k | 44.1 |
| SegFormer | MiT-B2 | IN-1k | 46.5 |
| MaskFormer | Swin-T | IN-1k | 46.7 |
| InstructDiffusion | (SD1.5) | LSTI | 33.6 |
| PixWizard | (Lumina-Next-T2I) | LSTI | 32.8 |
| FlowSeg (Ours) | PixNerd | IN-1k | 47.1 |
| DeeplabV3+ | ResNet50 | IN-1k | 38.4 |
| OCRNet | HRNet-W48 | IN-1k | 42.3 |
| SegFormer | MiT-B2 | IN-1k | 44.6 |
| SymmFlow | (SD2.1) | LSTI | 39.6 |
| FlowSeg (Ours) | PixNerd | IN-1k | 44.9 |
- Vanilla flow matching suffers gradient vanishing near semantic centroids and lacks repulsion from non-target centroids, hindering convergence and discrimination.
- A vector field reshaping with a distance-aware correction term improves gradient magnitudes around centroids and introduces attractive/repulsive forces, accelerating convergence and improving separation.
- A Kronecker-sequence inspired quasi-random centroid encoding yields balanced, deterministic centroid placement in [-1,1]^3.
- Pixel neural field decoding enables end-to-end training at pixel level without VAEs, preserving fine-grained spatial information.
- FlowSeg achieves mIoU of 47.1 on ADE20K and 44.9 on COCO-Stuff, outperforming several discriminative baselines and diffusion-only methods, despite using ImageNet-1k pretraining.
- FlowSeg demonstrates faster convergence and robustness across sampling steps, with deterministic predictions unlike stochastic baselines.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。