QUICK REVIEW

[論文レビュー] TRec: Learning Hand-Object Interactions through 2D Point Track Motion

Dennis Holzmann, Sven Wachsmuth|arXiv (Cornell University)|Jan 7, 2026

Human Pose and Action Recognition被引用数 0

ひとこと要約

TRecは、Transformer内で画像フレームと共に2Dランダムにサンプリングしたポイント軌跡を用い、手-物体アクションを明示的な手/物体検出なしで認識可能。Something-Something-v2でRGBのみのベースラインより改善を達成。

ABSTRACT

We present a novel approach for hand-object action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and the point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for hand-object action understanding.

研究の動機と目的

Explicit hand/object detection or RGB-only cuesに依存せず、手-物体アクション認識を動機付ける。
2D点軌跡が細かな動作に意味のある情報を提供するかを調査する。
ポイント軌跡と画像特徴をTransformerベースのモデルに統合する有効性を示す。

提案手法

各ビデオにつき900個のランダムな2D点をサンプルし、CoTrackerで追跡して点軌跡を取得する。
軽量な画像エンコーダ（ResNet18）でフレーム特徴を抽出し、点軌跡と共にTransformerへ入力する。
マルチヘッドアテンションプーリング層を用いてTransformer出力を集約する。
クロスエントロピー損失で学習するMLP分類ヘッドでアクションを予測する。
同一のアーキテクチャと学習 regimeの下で、Track-awareモデル（TRec）をRGBのみのベースラインと比較する。

実験結果

リサーチクエスチョン

RQ12D点軌跡は、明示的な手/物体検出なしで手-物体アクション認識に有用な補足的な動作手が提供できるか。
RQ22D点軌跡の組み込みはSomething-Something-v2の認識精度にどう影響するか。
RQ3追跡点の数は性能にどのような影響を及ぼすか。
RQ4 egocentric動画における背景動作はアクション認識に寄与するか、KDEベースのフィルタリングは性能にどう影響するか。
RQ5モーション軌跡を利用するには単一画像入力で十分か。

主な発見

Model	Top-1	Top-5
TRec	61.10 ± 8.66	83.95 ± 6.62
baseline	30.27 ± 8.05	53.24 ± 8.75

TRecはSomething-Something-v2でRGBのみのベースラインを著しく上回る（Top-1 61.10% vs 30.27% RGBのみ）。
50点以上で性能は安定しており、100点を超えると利得は逓減する；25点未満は精度が低下。
背景動作はアクション認識に意味のある寄与をする；前景点をKDEでフィルタリングすると精度が低下。
最初のフレームと点軌跡のみを用いても、このタスクでフル動画で訓練したRGBベースラインを上回る。
単一画像の評価でも、初期フレームに手/物体が映らなくてもモーション軌跡は強い手掛かりとなる。
背景と点軌跡で捉えたモーション手掛かりにより、手/物体検出なしでも頑健な認識が可能。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。