QUICK REVIEW

[論文レビュー] SEGAR: Selective Enhancement for Generative Augmented Reality

Fanjun Bu, Chenyang Yuan|arXiv (Cornell University)|Mar 25, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

tldr：SEGAR は、まず領域固有の編集を伴う将来の拡張フレームを生成し、次に安全 critical な領域を選択的に修正して実観察と一致させつつ編集を保持する二段階フレームワークを導入する。運転状況で実証。

ABSTRACT

Generative world models offer a compelling foundation for augmented-reality (AR) applications: by predicting future image sequences that incorporate deliberate visual edits, they enable temporally coherent, augmented future frames that can be computed ahead of time and cached, avoiding per-frame rendering from scratch in real time. In this work, we present SEGAR, a preliminary framework that combines a diffusion-based world model with a selective correction stage to support this vision. The world model generates augmented future frames with region-specific edits while preserving others, and the correction stage subsequently aligns safety-critical regions with real-world observations while preserving intended augmentations elsewhere. We demonstrate this pipeline in driving scenarios as a representative setting where semantic region structure is well defined and real-world feedback is readily available. We view this as an early step toward generative world models as practical AR infrastructure, where future frames can be generated, cached, and selectively corrected on demand.

研究の動機と目的

実時間的一貫性を持つ事前生成拡張未来を可能にすることで、生成的世界モデルを実用的なARインフラとして位置づける。
拡散ベースの世界モデルと選択的修正機構を組み合わせ、重要な領域で出力を実世界の観測に grounding する。
選択的修正が動的な運転シーンにおける安全 critical な忠実度を向上させつつ、意図的な拡張を維持できることを実証する。

提案手法

Stage I の生成スタイライザーとして Vista を基盤とし、領域特異的編集を伴う将来フレームを生成。
VACE ベースのインペインティングにより三つの conditioning フレームと十二フレームのターゲットを用いてエンドツーエンドで Stage I を訓練。
空間的にマスクされた潜在復元損失を用いた、安全 critical な領域を実世界観測と整合させつつ拡張を保持する Stage II を導入。
Stage II の conditioning は VAE 潜在 grounding（実観測）と CLIP セマンティック文脈（拡張フレーム）を分離して修正を導く。
領域間の遷移で再構成損失を避けるためのバッファゾーンを設け、領域特異的損失にはマスクのダウンサンプリング手法を用いる。

Figure 1 : SEGAR system pipeline overview. In Stage I, we train a Vista-based generative stylizer to take three condition frames ( $t\in[1,3]$ ) and output future frames with desired augmented edits ( $t\in[4,12]$ ). In Stage II, the generative stylizer finetuned with LoRA takes the augmented future

実験結果

リサーチクエスチョン

RQ1AR における領域特異的編集を伴う temporally coherent な将来を生成する生成拡散モデルとは何か。
RQ2軽量な選択的修正段階は、計画的な拡張を崩さず現実世界観測への安全critical忠実度を改善できるか。
RQ3Stage II による修正は、運転状況における安全 critical 領域の整合性と、拡張スタイルの維持との間でどのような影響を及ぼすか。
RQ4オフラインマスクを用いた領域ベースの損失がフレームごとの現実 grounding をどれだけ効果的に強制できるか。

主な発見

Real vs Corr. (SSIM)	Real vs Corr. (LPIPS)	Real vs Aug. (SSIM)	Real vs Aug. (LPIPS)	Aug vs Corr. (SSIM)	Aug vs Corr. (LPIPS)
0.9431	0.2854	0.7698	0.3968	0.8662	0.1295

Stage II は Stage I と比較して安全critical な領域の整合性を大幅に改善（SSIM が 0.770 から 0.943、LPIPS が 0.397 から 0.285）.
拡張領域は Stage I の拡張と比較して意図した編集を保持（SSIM 0.866、LPIPS 0.130）。
Stage II 後は重要領域で現実観測と拡張のドリフトが低減され、非重要な編集は視覚的に一貫性を維持。
定性的結果では、歩行者・車両・交通標識などの安全 critical 要素が現実観測と一致して修正されている。
本アプローチは、運転のようなリアルタイム設定で将来の AR フレームを生成・キャッシュ・選択的修正する道を示している。

Figure 2 : Given an input image sequence, we compute inpainting regions using semantic segmentation. The resulting masks guide VACE’s inpainting process to augment static scene elements into a Tokyo-style appearance.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。