QUICK REVIEW

[논문 리뷰] PARSE: Part-Aware Relational Spatial Modeling

Yinuo Bai, Peijun Xu|arXiv (Cornell University)|2026. 03. 08.

3D Shape Modeling and Analysis인용 수 0

한 줄 요약

PARSE는 부분 중심 표현(PAG)과 관계 객체 간 부품 관계를 모델링하여 물리적으로 타당한 3D 실내 장면을 구축하는 솔버를 도입합니다; 또한 공간 추론과 3D 생성을 개선하기 위해 촘촘한 부품 수준 주석이 포함된 PARSE-10K를 공개합니다.

ABSTRACT

Inter-object relations underpin spatial intelligence, yet existing representations -- linguistic prepositions or object-level scene graphs -- are too coarse to specify which regions actually support, contain, or contact one another, leading to ambiguous and physically inconsistent layouts. To address these ambiguities, a part-level formulation is needed; therefore, we introduce PARSE, a framework that explicitly models how object parts interact to determine feasible and spatially grounded scene configurations. PARSE centers on the Part-centric Assembly Graph (PAG), which encodes geometric relations between specific object parts, and a Part-Aware Spatial Configuration Solver that converts these relations into geometric constraints to assemble collision-free, physically valid scenes. Using PARSE, we build PARSE-10K, a dataset of 10,000 3D indoor scenes constructed from real-image layout priors and a curated part-annotated shape database, each with dense contact structures and a part-level contact graph. With this structured, spatially grounded supervision, fine-tuning Qwen3-VL on PARSE-10K yields stronger object-level layout reasoning and more accurate part-level relation understanding; furthermore, leveraging PAGs as structural priors in 3D generation models leads to scenes with substantially improved physical realism and structural complexity. Together, these results show that PARSE significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.

연구 동기 및 목표

객체 수준 관계를 넘어서 물리적으로 일관된 3D 레이아웃을 보장하기 위한 더 세밀한 공간 추론 프레임워크를 제시한다.
장면 조립을 위한 부품 간 기하학적 관계를 인코딩하는 Part-centric Assembly Graph (PAG)를 개발한다.
부품 수준 관계를 기하학적 제약으로 변환하여 충돌 없는 장면을 조립하는 솔버를 만든다.
학습과 평가를 지원하기 위한 부품 분할 자산과 촘촘한 부품 수준 접촉 그래프를 갖춘 대규모 데이터셋 PARSE-10K를 구축한다.
PAG와 PARSE-10K가 VLM 기반 공간 추론 및 3D 장면 생성성을 향상시킴을 입증한다.

제안 방법

객체 노드와 부품 노드로 구성된 2단계 PAG를 정의하고, 부품 수준 기하학적 간선과 객체 수준 간선으로 연결한다.
파트/표면 주석을 포함한 방향성 전치사(on, in, against)로 부품 간 관계를 형식화한다.
조립 순서로 PAG를 순회하고 거칠게부터 세밀한 기하 제약을 적용하여 충돌 없는 포즈를 샘플링하는 Part-Aware Spatial Configuration Solver를 개발한다.
지지면에서 2D 거친 위치 추정 후 부품 수준 정렬 제약을 강제하고, 그 다음 샘플링 및 충돌 점검을 수행한다.
짧은 물리 시뮬레이션(Sapien)으로 최종 장면을 다듬고 부품 수준 접촉 그래프를 생성한다.
실사진 레이아웃 프라이어를 수집하고 132-카테고리 부품 주석 자산 라이브러리를 구성하며 촘촘한 부품 수준 접촉으로 10,000개의 실내 장면을 렌더링하여 PARSE-10K를 구성한다.

실험 결과

연구 질문

RQ1부품 수준 관계가 공간 추론을 어떻게 강화하고 3D 장면 레이아웃의 애매함을 어떻게 줄일 수 있는가?
RQ2부품 중심 그래프(PAG)가 포즈 합성을 효율적으로 제약하여 물리적으로 타당한 장면을 생성할 수 있는가?
RQ3미세한 부품 수준 감독이 VLM 기반 공간 이해 및 장면 생성 품질을 향상시키는가?
RQ4PAG 선행 정보가 생성된 3D 실내 장면의 현실감과 복잡도에 미치는 영향은 무엇인가?

주요 결과

모델	시각 관계 재현율	부품 수준 접촉 재현율	장면 그래프 생성(WithBBox/NoBBox)	장면 그래프 생성(재현/정밀도/F1)	평균 관계 수
GPT-5	82.1	75.2	13.7/40.9	13.9/41.3/13.8/41.1	15.3
Gemini-2.5-Pro	85.0	75.6	40.5/43.4	48.6/52.0/44.2/47.3	12.9
Claude-Opus-4	80.3	73.2	8.0/33.7	12.7/53.7/9.8/41.4	9.7
Robobrain2.0	60.8	37.2	9.2/11.3	26.7/32.8/13.7/16.9	5.6
Qwen3-VL	86.2	60.4	26.0/29.6	46.0/52.4/33.2/37.9	8.7
Ours	97.4	86.2	73.2/74.8	80.3/82.0/76.6/78.2	14.1

PARSE-10K에서 Qwen3-VL을 미세조정하면 객체 수준 레이아웃 추론과 부품 수준 관계 이해가 향상된다.
3D 생성에서 PAG를 구조적 선행 정보로 사용하면 물리적 현실감과 구조적 복잡성이 더 높은 장면이 얻어진다.
PARSE-10K는 VLM에서 더 강한 시각적 근거와 관계 추론을 가능하게 하며 더 제어 가능하고 현실적인 장면 생성을 지원한다.
PARSE-10K 기반 모델이 시각적 관계, 부품 수준 접촉 및 장면 그래프 생성 과제에서 기준선보다 우수하다.
이 데이터셋과 프레임워크는 공간 추론 벤치마크와 3D 생성 품질 모두에서 실질적인 향상을 가져온다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.