QUICK REVIEW

[논문 리뷰] Time-Contrastive Networks: Self-Supervised Learning from Video

Pierre Sermanet, Corey Lynch|arXiv (Cornell University)|2017. 04. 23.

Human Pose and Action Recognition참고 문헌 48인용 수 53

한 줄 요약

논문은 Time-Contrastive Networks (TCN)를 소개합니다. 이는 비지도, 다-view 표현 학습 방법으로, unlabeled 비디오에서 3인칭 모방과 시각 입력만으로 로봇 제어를 가능하게 합니다.

ABSTRACT

We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that captures the relationships between end-effectors (hands or robot grippers) and the environment, object attributes, and body pose. We train our representations using a metric learning loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. In other words, the model simultaneously learns to recognize what is common between different-looking images, and what is different between similar-looking images. This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm. While representations are learned from an unlabeled collection of task-related videos, robot behaviors such as pouring are learned by watching a single 3rd-person demonstration by a human. Reward functions obtained by following the human demonstrations under the learned representation enable efficient reinforcement learning that is practical for real-world robotic systems. Video results, open-source code and dataset are available at https://sermanet.github.io/imitate

연구 동기 및 목표

unlabeled 다-view 비디오에서 객체 상호작용과 자세의 뷰포인트 불변하고 분리된 표현을 학습한다.
명시적 포즈 라벨이나 대응 없이 제3의 사람 비디오에서 인간 행동 모방을 가능하게 한다.
비디오 데이터에서 학습된 TCN 임베딩을 통해 강화 학습 보상 신호를 제공한다.
시뮬레이션과 실제 로봇에서 TCN 기반 가이던스를 사용하여 표지/주방 정리 작업을 시연한다.

제안 방법

다른 시점(view)에서 동시 발생하는 프레임(anchor, positive) 간의 트리플 손실을 사용해 임베딩 f(x)를 학습하고, 시간상으로 인접한 음수(negative)에 대하여 학습한다.
다-view 데이터를 사용해 시각적 변화의 해석을 Grounding하고 구분하여 시점(viewpoint), 가림(occlusion), 조명, 배경에 대한 불변성을 달성한다.
다-view 데이터가 이용 불가능한 경우 특定 Positive 윈도우를 가진 단일-view TC 손실을 선택적으로 사용한다.
제곱 거리 항과 허버 스타일 항을 이용하여 강화 학습 보상 함수를 형성하기 위해 32차원 TCN 임베딩을 활용한다.
비디오 시연에서 학습된 TCN 특징을 PILQR 기반 정책 최적화에 통합하여 물리적 조작 작업을 학습한다.
인간 및 로봇 모션에 대해 공유된 TCN 임베딩으로 직접 포즈 모방을 수행한다.

실험 결과

연구 질문

RQ1Time-Contrastive Networks가 뷰포인트와 외관에 불변하면서 포즈와 물체 상호작용을 분리하는 표현을 학습할 수 있는가?
RQ2학습된 TCN 임베딩이 제3자 시연으로부터 복잡한 조작 기술을 얻기 위한 RL 보상 신호로서 견고하게 작동하는가?
RQ3명시적 포즈 또는 대응 라벨 없이 제3자 비디오에서 모방이 가능한가?
RQ4다-view 훈련 신호가 단일-view에 비해 표현 품질과 로봇 학습 결과에 어떤 영향을 미치는가?
RQ5포즈 라벨 없이도 인간 포즈의 실시간, 연속 모방을 TCN으로 지원하는가?

주요 결과

방법	정렬 오차	분류 오차	훈련 반복
Random	28.1%	54.2%	-
Inception-ImageNet	29.8%	51.9%	-
shuffle & learn [31]	22.8%	27.0%	575k
single-view TCN (triplet)	25.8%	24.3%	266k
multi-view TCN (npairs)	18.1%	22.2%	938k
multi-view TCN (triplet)	18.8%	21.4%	397k
multi-view TCN (lifted)	18.0%	19.6%	119k

다-view TCN은 확인 및 보정 기준에서 포즈 명시와 속성 분류 모두에서 베이스라인보다 우수하다.
mvTCN은 실제 환경에서의 원활한 비디오 기반의 부어넣기와 설간대(디시랙) 조작을 가능하게 하며, 실제 로봇에서 약 10번의 반복 이후 부어내기 성능이 수렴한다.
단일-view TCN과 shuffle-and-learn 베이스라인은 같은 데이터에도 불구하고 mvTCN에 비해 성능이 떨어지며, 다-view 신호가 학습 속도를 가속화한다.
TCN 기반 보상은 PILQR 기반 강화 학습이 실제 로봇과 시뮬레이션의 디시 랙 과제를 포함한 부어내기를 학습하도록 가능하게 하며, 다른 표현들보다 우수한 성능을 보인다.
공유된 TCN 임베딩으로의 직접 포즈 모방은 관절 수준 포즈 라벨 없이도 엔드-투-엔드 모방이 가능하게 하며, 제한된 인간 감독으로 확장될 수 있다.
본 접근법은 제3자 비디오로부터의 강건한 모방 및 신속한 과제 습득 등 강한 질적 결과를 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.