QUICK REVIEW

[論文レビュー] CORAL: Correspondence Alignment for Improved Virtual Try-On

Jiyoung Kim, Youngjin Shin|arXiv (Cornell University)|Feb 19, 2026

3D Shape Modeling and Analysis被引用数 0

ひとこと要約

CORALはDiffusion Transformers内で人-衣服のクエリ-キー対応を明示的に揃えることで仮想試着を改善し、対応蒸留損失とエントロピー最小化損失を用いて注意をシャープ化し細部を保持します。

ABSTRACT

Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.

研究の動機と目的

Poseおよび衣服のバリエーションを横断する局所的なテクスチャと形状を保持するために、VTONにおける正確な人-衣服の整合性の必要性を動機づける。
Diffusion Transformersにおける全体的な3D注意の分析を通じて、VTON品質におけるクエリ-キー対応の重要性を明かす。
堅牢な外部マッチャーからの蒸留と注意のシャープ化を通じて対応を明示的に導くCORALを提案する。
標準的なVTONベンチマークでの最先端性能を実証し、VLMベースおよび人間評価を導入して全体的な品質評価を行う。

提案手法

トークンレベルの相互作用を可能にする2パネルの衣服-人物ディプティック潜在表現を備えたDiffusion Transformer (DiT) バックボーンを使用する。
CORAL損失を導入する： (i) DiTの人物→衣服の注意をDINOv3の擬似真実マッチと整列させる対応蒸留損失、(ii) 注意分布をシャープ化するエントロピー最小化損失。
注意のソフトアルグマックスを計算して微分可能な対応を得て、擬似真実対応に対して平均L2損失を適用する。
潜在拡散からの速度損失とCORAL損失を一緒に学習させて、グローバル構造と局所的ディテール伝達の両方を改善する。
標準的なVTON指標（SSIM, LPIPS, FID, KID）と新規のVLMベースプロトコルおよび人間評価を用いて、知覚的忠実度と属性整合性を捉える。

Figure 2 : Correlation between Query-Key Matching and VTON Performance. Pink marker denotes the query points. (a) presents qualitative correlation between VTON performance and person $\to$ garment attention. All outputs in (a) are generated by the baseline. Human-preferred outputs show accurately lo

実験結果

リサーチクエスチョン

RQ1Diffusion Transformersにおける全体的な3D注意はVTONにおける人-衣服対応をどのように符号化するのか。
RQ2外部の頑健な対応（例：DINOv3 など）をDiTの注意に蒸留してVTONの精度を向上させられるか。
RQ3対応集中の損失（蒸留とエントロピー）はVTON出力のグローバルな衣服転送と局所的なディテールを一貫して改善するか。
RQ4VLMベースおよび人間評価は、特にアンパード設定において従来の指標よりVTON品質の評価に優れているのか。
RQ5CORALは標準ベンチマークと野外/アンパードシナリオでどのような性能を示すか。

主な発見

Method	Paired SSIM	Paired LPIPS	Paired FID	Paired KID	Unpaired FID	Unpaired KID
GPVTON (Xie et al., 2023)	0.878	0.067	8.938	4.257	11.993	4.570
StableVTION (Kim et al., 2023)	0.888	0.073	8.233	0.490	9.026	3.029
OOTDiffusion (Kim et al., 2023)	0.842	0.087	6.619	0.845	9.938	1.302
IDM-VTON (Choi et al., 2024)	0.866	0.062	6.009	0.838	9.198	1.203
CatVTON (Chong et al., 2025)	0.874	0.058	5.458	0.439	9.076	1.184
Any2AnyTryOn (Guo et al., 2025)	0.838	0.087	5.482	0.384	9.623	1.601
CORAL w/o CORAL (our baseline)	0.889	0.055	5.543	0.870	9.641	1.423
CORAL (ours)	0.907	0.048	4.962	0.565	8.763	0.880

CORALは、VITON-HDとDressCodeにおいて、ペアド・アンパード双方の設定でSSIM、LPIPS、FID、KIDのいずれも最先端の結果を達成した。
CORAL損失を追加すると、強力なベースラインに対して指標が一貫して改善され、全組み合わせが最高の性能をもたらす。
VLMベースおよび人間評価は、CORALが衣服転送の一貫性、属性整合性、ポーズ適合性リアリズムを優位に提供することを示している。
アブレーションにより、対応蒸留とエントロピー最小化を組み合わせると、よりシャープで適切に配置された注意とより良いディテール伝達が得られることが示された。

Figure 3 : Overall Architecture. CORAL builds upon a baseline architecture that constructs the noisy latent $\mathbf{z}_{t}$ by horizontally concatenating the noisy garment latents $\mathbf{z}_{\text{g},t}$ and person latents $\mathbf{z}_{\text{p},t}$ , and then channel-wise concatenates the conditi

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。