QUICK REVIEW

[논문 리뷰] FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

Zhifei Yang, Guangyao Zhai|arXiv (Cornell University)|2026. 03. 20.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

FlowScene는 다중모달 그래프를 이용해 레이아웃, 형태, 질감을 함께 생성하는 고충실도 3D 실내 장면을 생성하며, 객체별 제어 및 장면 전반의 스타일 일관성을 강하게 연결하는 다중모달 그래프 보정 흐름(Multimodal Graph Rectified Flow)을 통해 가능하게 한다.

ABSTRACT

Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.

연구 동기 및 목표

설계, VR/AR, 로봇 공학 및 자동화 응용을 위한 실내 장면 생성에서 기하학적 형상과 외관에 대한 정밀한 제어의 동기를 부여한다.
객체와 관계를 나타내기 위해 텍스트와 시각 입력을 통합하는 다중모달 그래프 기반 프레임워크를 제안한다.
레이아웃, 형태, 질감의 세 가지 가지 분기를 갖는 제너레이터를 개발하여 개별 객체의 충실도와 장면 전체의 스타일 일관성을 함께 보장한다.
생성 도중 노드 간의 정보 교환형 디노이징을 가능하게 하는 Multimodal Graph Rectified Flow를 도입한다.
3D-FRONT/SG-FRONT 데이터셋에서 언어 조건 및 그래프 조건 기준선보다 더 우수한 실감도와 스타일 일관성을 입증한다.

제안 방법

각 노드가 텍스트 및 시각 특징(텍스트 u_i 및 시각 f_i)과 선택적 모달리티를 집계하는 다중모달 실내 장면 그래프를 정의한다.
생성 중 노드 간 그래프 조건의 디노이징 정보를 전파하기 위해 트립렛-GCN 기반의 InfoExchangeUnit을 사용한다.
그래프에서 도출된 제약에 조건화된 보정 흐름 기반 디노이저를 각 분기가 사용하는 세 가지 분기 파이프라인(Layout, Shape, Texture)를 채택한다.
Layout 분기는 장면 레이아웃용 3D 경계 상자를 모델링하고 LayoutExchangeUnit를 적용하여 시간적/전역 제약을 처리한다.
Shape 분기는 객체를 보셀화하고 모양 VQ-VAE를 사용해 잠재 코드를 얻으며 교차 객체 형상 일관성을 위해 ShapeExchangeUnit을 고용한다.
Texture 분기는 질감 잠재 코드를 기하학에 고정하고 다중 뷰 특성을 추출하며 TextureExchangeUnit를 사용해 객체 간 질감 일관성을 보장한다.
예측 속도와 목표 속도 간의 차이를 최소화하는 공유된 보정 흐름 목표로 모든 분기를 학습시켜 빠르고 적은 스텝 샘플링을 가능하게 한다.

Figure 1. Scene Generation from Diverse Input. The prospective system, powered by FlowScene , supports the generation of style-consistent 3D scenes from multi-source descriptions, including text input, GUI selections, and mixed information. Users can flexibly specify object categories and, if desire

실험 결과

연구 질문

RQ1다중모달 그래프 조건 흐름 모델이 개체 수준 제어와 장면 수준 스타일 일관성을 존중하는 질감이 있는 3D 장면을 생성할 수 있는가?
RQ2언어만 또는 그래프만 기준선과 비교할 때 그래프를 통한 객체 관계의 명시적 모델링이 현실감, 스타일 일관성 및 사용자 정렬 출력 향상에 기여하는가?
RQ3디노이징 중 노드 간 정보 교환이 개별 객체 충실도와 전체 장면 품질에 어떤 영향을 미치는가?
RQ4레이아웃, 형태, 질감 분기를 공동으로 학습시키는 가 끝-to-end 장면 합성 품질과 효율성에 어떤 영향이 있는가?

주요 결과

FlowScene은 SG-FRONT 및 3D-FRONT 벤치마크에서 현실감, 스타일 일관성 및 사람의 선호 정렬에서 언어 조건 및 그래프 조건 기준선보다 우수하다.
다중모달 그래프 보정 흐름과 함께 세 가지 분기 설계는 장면 전체의 일관성을 강제하는 동시에 세밀한 객체 수준 제어(형상, 질감)를 가능하게 한다.
이 방법은 이전 확산 기반 그래프 조건 접근법보다 빠른 생성 속도를 제공하고 객체별 충실도 및 전체 장면 품질이 더 우수함을 보여준다.
다중모달 그래프(text + images)는 텍스트 전용, 이미지 전용 또는 혼합 입력을 처리할 수 있어 유연한 장면 구성을 가능하게 한다.
실험 결과에는 지각 연구와 장면 수준/객체 수준 지표가 포함되어 있어 프롬프트 준수, 레이아웃 정확성, 시각적 품질 및 스타일 일관성의 향상을 나타낸다.

Figure 9. Failure case. The left panel shows the input multimodal scene graph, while the right panel shows the generated failure case. Red cross marks indicate removed relationships.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.