QUICK REVIEW

[논문 리뷰] TAP-Vid: A Benchmark for Tracking Any Point in a Video

Carl Doersch, Ankush Gupta|arXiv (Cornell University)|2022. 11. 07.

Advanced Vision and Imaging인용 수 26

한 줄 요약

TAP-Vid는 Tracking Any Point (TAP) 문제를 형식화하고, 변형 가능한 표면에서 장기, 점-수준 추적을 평가하기 위해 실제 비디오와 합성 비디오를 결합한 벤치마크를 도입합니다. 또한 벤치마크에서 이전 방법을 능가하는 엔드-투-엔드 베이스라인인 TAP-Net을 제안합니다.

ABSTRACT

Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.

연구 동기 및 목표

변형 가능한 표면에서의 장기 움직임 이해를 위한 Tracking Any Point (TAP) 문제를 형식화한다.
밀집 점 추적 경로와 가려짐 레이블을 포함하는 실제 및 합성 혼합 벤치마크인 TAP-Vid를 만든다.
TAP를 위한 주석 파이프라인과 강력한 엔드-투-엔드 베이스라인을 제공하고, 데이터셋 특성과 베이스라인을 분석한다.

제안 방법

TAP를 쿼리된 점 (x, y, t)을 모든 프레임에 걸쳐 추적하고 프레임별 가려짐(occlusion)을 예측하는 것으로 정의한다.
실제(Kinetics, DAVIS) 및 합성(Kubric MOVi-E, RGB-Stacking) 데이터셋으로 TAP-Vid를 구성한다.
희소한 점을 밀집한 트랙으로 확장하기 위해 광류(optical flow)를 사용한 반자동 추적 보조 주석 파이프라인을 개발한다.
쿼리 점을 모든 비디오 위치와 비교하기 위해 비용 부피(cost volumes)를 활용하고 위치와 가려짐을 회귀하는 엔드-투-엔드 네트워크인 TAP-Net을 제안한다.
가시 프레임에 대한 Huber 회귀와 가려짐에 대한 교차 엔트로피를 결합한 세 부분 손실을 사용한다.

실험 결과

연구 질문

RQ1변형 가능한 표면에서의 임의 점 추적을 전체 비디오 시퀀스에 걸쳐 형식화하고 평가하는 방법은 무엇인가?
RQ2합성 데이터가 실제 비디오로 전이되는 효과적인 TAP 추적기를 학습하는 데 도움이 될 수 있는가?
RQ3가려짐 추정과 함께 엔드-투-엔드 TAP 추적을 위한 효과적인 아키텍처와 손실 함수는 무엇인가?
RQ4기존의 추적 방법들이 TAP-Vid 데이터셋에서 어떤 성능을 보이며 어디에서 한계에 봉착하는가?
RQ5실제 TAP 벤치마크의 신뢰 가능한 주석을 위해 필요한 주석 전략과 품질 관리 기준은 무엇인가?

주요 결과

TAP-Net은 모든 TAP-Vid 데이터셋에서 이전의 베이스라인 대비 큰 차이로 우수한 성능을 보인다.
트랙 보조 광류 파이프라인은 합성 데이터에서 ground truth와의 정렬을 99% 포인트에서 8 픽셀 이내로 달성하며 효율적이고 정확한 주석 작성을 가능하게 한다.
실제 인간 주석은 약 95.5%의 가려짐과 약 92.5%의 위치 일치를 평가자 간 4픽셀 이내로 보여준다.
TAP-Vid-Kinetics, TAP-Vid-DAVIS, TAP-Vid-Kubric, TAP-Vid-RGB-Stacking은 실제 및 합성 데이터에 대해 다양한 평가 설정을 제공한다.
가려짐 처리나 변형 가능한 물체 적응이 없는 베이스라인 방법은 TAP-Vid 데이터셋에서 TAP-Net에 비해 성능이 낮다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.