QUICK REVIEW

[논문 리뷰] Learning Physical Graph Representations from Visual Scenes

Daniel M. Bear, Chaofei Fan|arXiv (Cornell University)|2020. 06. 22.

Human Pose and Action Recognition참고 문헌 52인용 수 44

한 줄 요약

PSGNet은 물리적 장면 그래프(PSGs)를 학습하여 장면을 계층적이고 객체 중심의 그래프로 표현하고, 모션 큐와 지각적 그룹화 원칙의 도움으로 실제 세계의 장면 분할에서 CNN 기반 자가 감독 방법보다 우수한 성능을 보입니다.

ABSTRACT

Convolutional Neural Networks (CNNs) have proved exceptional at learning representations for visual object categorization. However, CNNs do not explicitly encode objects, parts, and their physical properties, which has limited CNNs' success on tasks that require structured understanding of visual scenes. To overcome these limitations, we introduce the idea of Physical Scene Graphs (PSGs), which represent scenes as hierarchical graphs, with nodes in the hierarchy corresponding intuitively to object parts at different scales, and edges to physical connections between parts. Bound to each node is a vector of latent attributes that intuitively represent object properties such as surface shape and texture. We also describe PSGNet, a network architecture that learns to extract PSGs by reconstructing scenes through a PSG-structured bottleneck. PSGNet augments standard CNNs by including: recurrent feedback connections to combine low and high-level image information; graph pooling and vectorization operations that convert spatially-uniform feature maps into object-centric graph structures; and perceptual grouping principles to encourage the identification of meaningful scene elements. We show that PSGNet outperforms alternative self-supervised scene representation algorithms at scene segmentation tasks, especially on complex real-world images, and generalizes well to unseen object types and scene arrangements. PSGNet is also able learn from physical motion, enhancing scene estimates even for static images. We present a series of ablation studies illustrating the importance of each component of the PSGNet architecture, analyses showing that learned latent attributes capture intuitive scene properties, and illustrate the use of PSGs for compositional scene inference.

연구 동기 및 목표

물리적 의미를 갖는 노드 속성을 가진 계층적이고 객체 중심적인 장면 표현으로 Physical Scene Graphs (PSGs)를 도입한다.
PSG 기반 병목을 통해 장면을 재구성하는 자가 감독 아키텍처인 PSGNet을 개발한다.
시각 데이터에서 PSG를 학습하고 렌더링하기 위해 지각적 그룹화 원칙과 그래프 기반 연산을 도입한다.
PSGNet이 실제 이미지에서 뛰어난 비지도 장면 분할 성능을 달성하고 모션 큐로부터 이점을 얻는다는 것을 시연한다.

제안 방법

이미지 영역에 연결된 노드별 속성을 가진 계층적이며 그래프 기반의 장면 표현(PSG)을 정의한다.
특징을 추출하고 PSG 구성용 기본 텐서를 생성하기 위해 ConvRNN 백본을 사용한다.
학습 가능한 Graph Pooling과 Graph Vectorization을 적용하여 PSG 레이어를 반복적으로 구축한다.
그래프 렌더링 모듈을 통해 PSG를 특징 맵으로 다시 렌더링한다(Quadratic Texture Rendering 및 Quadratic Shape Rendering 포함).
정적 및 모션 기반의 지각적 그룹화 원칙을 통합하여 노드 간 친밀도 학습을 안내한다.
RGB/깊이/노말 맵에 대한 자가 감독 재구성 손실과 QSR/QTR 기반 감독으로 의미 있는 객체 분할을 촉진한다.

실험 결과

연구 질문

RQ1계층적이고 그래프 기반의 표현이 명시적 감독 없이도 의미 있는 객체 중심의 장면 구성 요소를 학습할 수 있는가?
RQ2모션 큐가 비지도 장면 분할 및 실제 영상으로의 일반화를 개선하는가?
RQ3지각적 그룹화 원칙과 그래프 기반 풀링/벡터화가 장면 구조 학습에 어떤 영향을 미치는가?
RQ4학습된 PSG가 서로 다른 데이터셋과 객체 유형 간에 어느 정도 전이되는가?

주요 결과

Model	Primitives Recall	Primitives mIoU	Primitives BoundF	Playroom Recall	Playroom mIoU	Playroom BoundF	Gibson Recall	Gibson mIoU	Gibson BoundF	Gibson ARI
MONet	0.35	0.40	0.46	0.28	0.34	0.46	0.06	0.12	0.15	0.27
IODINE	0.63	0.54	0.57	0.09	0.15	0.17	0.11	0.15	0.14	0.30
Q++ (RGBDN)	0.55	0.54	0.62	0.50	0.53	0.65	0.20	0.20	0.24	0.45
OP3	-	-	-	0.24	0.28	0.31	-	-	-	-
PSGNetS	0.75	0.65	0.70	0.64	0.57	0.66	0.34	0.38	0.37	0.53
PSGNetM	-	-	-	0.70	0.62	0.70	-	-	-	-

PSGNet은 Primitives, Playroom, Gibson 데이터셋에서 비지도 장면 분할 평가에 대해 MONet, IODINE, OP3 기반선보다 현저히 우수한 성능을 보인다.
PSGNetS를 사용한 정적 학습은 강력한 분할 성능을 보이며, Primitives에서 기반선보다 우수하고 Gibson에서 그럴듯한 분해를 달성한다.
Motion-based training (PSGNetM)은 Playroom에서 분할을 더욱 개선하고 학습된 모션 큐를 통해 정적 이미지 성능을 향상시킨다.
PSGNet은 강력한 전이를 보여 한 데이터셋에서 학습한 모델이 다른 데이터셋으로도 합리적으로 전이되며 객체 모델의 겹침이 제한적이다.
Abalations(특성은 로컬 재귀, 피드백, 제곱 렌더링 등)이 성능에 의미 있게 기여하며 깊이/노말 감독은 이득을 주지만 필수적이지 않다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.