QUICK REVIEW

[論文レビュー] TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification

Shengcai Liao, Ling Shao|arXiv (Cornell University)|May 30, 2021

Video Surveillance and Tracking Methods被引用数 38

ひとこと要約

TransMatcherは、簡易なクロス画像マッチングデコーダとグローバル最大プーリングを用いてTransformerを適応させ、効率的で汎用的な人物再識別を実現します。複数のデータセットで最先端の結果を達成。

ABSTRACT

Transformers have recently gained increasing attention in computer vision. However, existing studies mostly use Transformers for feature representation learning, e.g. for image classification and dense predictions, and the generalizability of Transformers is unknown. In this work, we further investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. Thus, we further design two naive solutions, i.e. query-gallery concatenation in ViT, and query-gallery cross-attention in the vanilla Transformer. The latter improves the performance, but it is still limited. This implies that the attention mechanism in Transformers is primarily designed for global feature aggregation, which is not naturally suitable for image matching. Accordingly, we propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity computation. Additionally, global max pooling and a multilayer perceptron (MLP) head are applied to decode the matching result. This way, the simplified decoder is computationally more efficient, while at the same time more effective for image matching. The proposed method, called TransMatcher, achieves state-of-the-art performance in generalizable person re-identification, with up to 6.1% and 5.7% performance gains in Rank-1 and mAP, respectively, on several popular datasets. Code is available at https://github.com/ShengcaiLiao/QAConv.

研究の動機と目的

Transformerが一般化可能な person re-id のために、画像ペア間の画像マッチングとメトリック学習を実行できるかを調査する。
Cross-image マッチングにおける ViTとヴァニラ Transformer の限界を評価する。
クロスイメージマッチングを可能にする軽量で類似度に焦点を当てたデコーダを提案する。
標準データセットと合成データセットの一般化を評価し、SOTA手法と比較する。

提案手法

クエリ画像とギャラリー画像から特徴を抽出するためにResNetバックボーンを使用する。
クエリとギャラリーをそれぞれTransformerエンコーダでエンコードしてQ_nとK_nを得る。
変換された特徴と共有FCを介してクエリ-ギャラリーの類似度を計算する簡略化デコーダを適用し、グローバル最大プーリングとMLPヘッドを経てペアワイズスコアを生成する。
局所的な類似度マッチを重み付けする学習可能な事前スコア埋め込みを組み込む。
残差類似度学習のためにN層にわたってデコーダ出力を融合する。
QAConv-GSフレームワークに従ったペアワイズメトリック学習目的で訓練する。）

実験結果

リサーチクエスチョン

RQ1Vision TransformerまたはヴァニラTransformerは、人物再識別のための画像ペア間の明示的な画像マッチングに一般化できるか？
RQ2単純な解法（クエリ-ギャラリー連結や入力クエリを用いたクロスアテンション）は、クロスイメージマッチングを改善するのか？
RQ3直接的な類似度計算に焦点を当てた簡略化デコーダは、Re-IDのメトリック学習の効率と性能を向上させるか？
RQ4データセット間および合成データに対する一般化におけるクロスイメージ相互作用の影響は何か？

主な発見

TransMatcherは、いくつかのデータセットで汎化可能な人物再識別の最先端の性能を達成します。
Market-1501をソースとして訓練すると、CUHK03-NPでRank-1が5.8%、mAPが5.7%、MSMT17でRank-1が6.1%、mAPが3.4%の改善を得る。
MSMT17をソースとして訓練すると、Market-1501でRank-1が5.0%、mAPが5.3%、MSMT17でRank-1が6.1%、mAPが3.4%の改善（報告どおり）。
RandPerson（合成データ）を用いた訓練は、Market-1501をRank-1で3.3%、mAPで5.3%向上、MSMT17をRank-1で5.9%、mAPで3.3%向上。
Transformer-Crossと比較して、TransMatcherはクロスマッチング性能が大幅に改善（例：Market-1501でRank-1が約11%、mAPが約9%向上）。
アブレーション研究は、簡略化デコーダ、GMPハードアテンション、および最良の精度のための事前スコア埋め込みの重要性を示す。エンコーダの位置エンベディングは、本設計では性能を妨げる可能性がある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。