QUICK REVIEW

[論文レビュー] TransCenter: Transformers with Dense Queries for Multiple-Object Tracking

Yihong Xu, Yutong Ban|arXiv (Cornell University)|Jul 22, 2021

Video Surveillance and Tracking Methods参考文献 69被引用数 75

ひとこと要約

TransCenter は、画像関連の密な検出クエリと疎追跡クエリを用いたセンター型トランスフォーマ MOT フレームワークを導入し、MOT ベンチマークで最先端の結果を達成します。

ABSTRACT

Transformers have proven superior performance for a wide variety of tasks\nsince they were introduced. In recent years, they have drawn attention from the\nvision community in tasks such as image classification and object detection.\nDespite this wave, an accurate and efficient multiple-object tracking (MOT)\nmethod based on transformers is yet to be designed. We argue that the direct\napplication of a transformer architecture with quadratic complexity and\ninsufficient noise-initialized sparse queries - is not optimal for MOT. We\npropose TransCenter, a transformer-based MOT architecture with dense\nrepresentations for accurately tracking all the objects while keeping a\nreasonable runtime. Methodologically, we propose the use of image-related dense\ndetection queries and efficient sparse tracking queries produced by our\ncarefully designed query learning networks (QLN). On one hand, the dense\nimage-related detection queries allow us to infer targets' locations globally\nand robustly through dense heatmap outputs. On the other hand, the set of\nsparse tracking queries efficiently interacts with image features in our\nTransCenter Decoder to associate object positions through time. As a result,\nTransCenter exhibits remarkable performance improvements and outperforms by a\nlarge margin the current state-of-the-art methods in two standard MOT\nbenchmarks with two tracking settings (public/private). TransCenter is also\nproven efficient and accurate by an extensive ablation study and comparisons to\nmore naive alternatives and concurrent works. For scientific interest, the code\nis made publicly available at https://github.com/yihongxu/transcenter.\n

研究の動機と目的

混雑したシーンにおける疎なクエリによるギャップや過検出を回避する、トランスフォーマーべースのMOT手法の動機づけと設計。
グローバルで堅牢な検出を提供するために、画像関連の密検出クエリを導入する。
効率的にフレーム間で物体を関連付けるための疎追跡クエリと特化したデコーダを開発する。
密なクエリを用いながら計算量を削減し、効率的なMOTを実現する。
精度と効率のバランスを取るためのバリアント（TransCenter、TransCenter-Dual、TransCenter-Lite）を提供する。

提案手法

連続するフレームからマルチスケールの密なメモリを生成するために、重み共有トランスフォーマーエンコーダ（PVTベース）を使用する。
エンコーダメモリから密検出クエリと疎追跡クエリを生成するクエリ学習ネットワーク（QLN）を導入する。
トラッキング（TDCA）と検出（DDCA）の両方にデフォーマブルクロスアテンションを用いた TransCenter デコーダを採用する。
前フレームの位置情報を用いた疎追跡クエリを活用して、時間を通じた物体の変位を計算する。
出力ブランチはセンターヒートマップ、物体サイズ、追跡変位を計算する。検出にはセンターヒートマップを、時間的な物体同定には追跡ブランチを用いる。
センターヒートマップ focal loss、サイズの疎回帰損失、追跡損失、および全体の重み付き損失で訓練する。

実験結果

リサーチクエスチョン

RQ1dense image-related detection queries を用いた Transformer-based MOTモデルは、 sparse-query DETR ベースの MOT アプローチより優れているか？
RQ2dense 検出クエリを疎追跡クエリと分離することで、混雑したシーンでの検出の頑健性と追跡の効率性は向上するか？
RQ3QLN およびデコーダの設計が MOT の精度と効率にどのように影響するか？
RQ4効率的なエンコーダ（PVT）とデフォーマブルアテンションの使用が MOT の実行時間と性能に与える影響は何か？

主な発見

TransCenter は MOT17（+4.0% MOTA）および MOT20（+18.8% MOTA）で自身の報告条件下で新しい最先端の MOT パフォーマンスを設定した。
混雑したシーンで、 dense image-related detection queries は固定された疎クエリと比べて見逃し検出やノイズを減らす。
前フレーム情報により駆動される疎追跡クエリは、精度を犠牲にすることなく追跡アテンションを大幅に高速化する。
TransCenter-Dual および TransCenter-Lite は、精度と計算効率の間のトレードオフを提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。