QUICK REVIEW

[論文レビュー] TAP-Vid: A Benchmark for Tracking Any Point in a Video

Carl Doersch, Ankush Gupta|arXiv (Cornell University)|Nov 7, 2022

Advanced Vision and Imaging被引用数 26

ひとこと要約

TAP-Vid は Tracking Any Point (TAP) 問題を形式化し、長期の点レベル追跡を変形可能な表面上で評価する現実データと合成データを組み合わせたベンチマークを導入します。さらに、ベンチマークで従来法を上回るエンドツーエンドのベースラインである TAP-Net を提案します。

ABSTRACT

Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.

研究の動機と目的

変形可能な表面上で長期的な動作理解のために Tracking Any Point (TAP) 問題を形式化する。
密な点軌跡と遮蔽ラベルを含む現実データと合成データを混在させた TAP-Vid ベンチマークを作成する。
TAP の注釈パイプラインと強力なエンドツーエンドのベースラインを提供し、データセットの特性とベースラインを分析する。

提案手法

TAP を、すべてのフレームにわたって照会点 (x, y, t) を追跡し、各フレームで遮蔽を予測することとして定義する。
現実データ（Kinetics, DAVIS）と合成データ（Kubric MOVi-E, RGB-Stacking）を組み合わせて TAP-Vid を組み立てる。
光学フローを用いて疎な点を密なトラックへ拡張する半自動のトラック支援注釈パイプラインを開発する。
クエリ点をすべてのビデオ位置と比較するコストボリュームを用い、位置と遮蔽を回帰するエンドツーエンドのネットワーク TAP-Net を提案する。
表示フレームにはHuber回帰と遮蔽にはクロスエントロピーを組み合わせた三部構成の損失を用いる。

実験結果

リサーチクエスチョン

RQ1変形可能な表面上の任意の点を、全ての動画シークエンスにわたってどのように形式化し評価することができるか？
RQ2合成データは実動画へ転移する効果的な TAP トラッカーの訓練を可能にするか？
RQ3遮蔽推定を伴うエンドツーエンドの TAP 追跡に有効なアーキテクチャと損失関数は何か？
RQ4既存の追跡法は TAP-Vid データセットでどのように性能を示し、どこが不足しているか？
RQ5信頼性の高い実世界の TAP ベンチマークのために、どのような注釈戦略と品質管理が必要か？

主な発見

TAP-Net はすべての TAP-Vid データセットで従来のベースラインを大幅に上回る。
トラック支援光学フローパイプラインは効率的かつ正確な注釈を可能にし、合成データにおける正解データへの高い整合性を達成（ポイントの 99% で 8 ピクセル以内）。
実データの人間注釈では、遮蔽の一致率約95.5%、評価者間の位置一致約92.5%が4ピクセル以内。
TAP-Vid-Kinetics、TAP-Vid-DAVIS、TAP-Vid-Kubric、TAP-Vid-RGB-Stacking は、現実データと合成データの多様な評価設定を提供します。
遮蔽処理や変形物体適応を欠くベースライン手法は、TAP-Vid データセットで TAP-Net と比べて性能が劣る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。