QUICK REVIEW

[論文レビュー] FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

Zhifei Yang, Guangyao Zhai|arXiv (Cornell University)|Mar 20, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

FlowScene は多模態グラフからレイアウト、形状、テクスチャを共同生成することで、高忠実度の3D屋内シーンを生成する。オブジェクトごとの制御とシーン全体のスタイル整合性を厳密に結合させる Multimodal Graph Rectified Flow を実装。

ABSTRACT

Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.

研究の動機と目的

設計、VR/AR、ロボティクス、自動化などの用途における屋内シーン生成の幾何と外観の精密な制御を実現すること。
オブジェクトと関係を表すテキストとビジュアル入力を統合する multimodal graph ベースのフレームワークを提案すること。
三分岐ジェネレーター（レイアウト、形状、テクスチャ）を開発し、オブジェクトごとの忠実性とシーン全体のスタイル整合性を同時に保証する。
生成過程でノード間の情報交換デノイズを可能にする Multimodal Graph Rectified Flow を導入すること。
3D-FRONT/SG-FRONT データセット上で、言語条件・グラフ条件ベースラインよりも現実感とスタイルの整合性が優れることを実証すること。

提案手法

各ノードがテキスト特徴 u_i と視覚特徴 f_i、および任意のモダリティを集約するマルチモーダルシーングラフを定義する。
生成過程でグラフ条件付きデノイズ情報をノード間に伝搬する triplet-GCN ベースの InfoExchangeUnit を用いる。
グラフ由来の制約に条件付けられた rectified flow ベースのデノイザーを各ブランチで使用する三分岐パイプライン（Layout、Shape、Texture）を採用する。
Layout ブランチはシーンのレイアウト用3D境界ボックスをモデル化し、LayoutExchangeUnit を用いて時間的/全体的制約を適用する。
Shape ブランチはオブジェクトをボクセル化し、形状 VQ-VAE で潜在コードを取得し、ShapeExchangeUnit を用いてオブジェクト間の形状整合性を確保する。
Texture ブランチはテクスチャ潜在コードをジオメトリに固定し、多視点特徴を抽出し、TextureExchangeUnit を用いてオブジェクト間のテクスチャ整合性を保証する。
全ブランチを共有の rectified-flow 目的関数で訓練し、予測速度とターゲット速度の差を最小化することで、少数ステップのサンプリングを実現する。

Figure 1. Scene Generation from Diverse Input. The prospective system, powered by FlowScene , supports the generation of style-consistent 3D scenes from multi-source descriptions, including text input, GUI selections, and mixed information. Users can flexibly specify object categories and, if desire

実験結果

リサーチクエスチョン

RQ1マルチモーダルグラフ条件付きフローモデルは、オブジェクトレベルの制御とシーンレベルのスタイル整合性を尊重したテクスチャ付き3Dシーンを生成できるか？
RQ2グラフを介したオブジェクト関係の明示的モデリングは、言語のみ・グラフのみのベースラインと比較して現実感・スタイル整合性・ユーザー指向出力を改善するか？
RQ3デノising 中のノード間情報交換は、オブジェクトごとの忠実度と全体的なシーン品質にどのように影響するか？
RQ4レイアウト・形状・テクスチャの三ブランチを jointly 学習させることが、エンドツーエンドのシーン合成の品質と効率にどのような影響を及ぼすか？

主な発見

FlowScene は SG-FRONT および 3D-FRONT のベンチマークで、言語条件付けベースライン・グラフ条件付けベースラインより現実感・スタイル整合性・人間の嗜好整合性で上回る。
Multimodal Graph Rectified Flow を用いた三分岐設計は、シーン全体の一貫性を保ちながら、形状・テクスチャといったオブジェクトレベルの厳密な制御を可能にする。
本法は、従来の拡散法ベースのグラフ条件付けアプローチより生成が速く、オブジェクトごとの忠実度と全体的なシーン品質が向上している。
テキスト+画像のマルチモーダルグラフは、テキストのみ・画像のみ・混合入力を柔軟に扱い、シーン構築の柔軟性を高める。
実験結果には知覚評価やシーンレベル/オブジェクトレベル指標が含まれ、プロンプト適合性・レイアウト正確性・視覚品質・スタイル整合性の改善を示す。

Figure 9. Failure case. The left panel shows the input multimodal scene graph, while the right panel shows the generated failure case. Red cross marks indicate removed relationships.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。