QUICK REVIEW

[論文レビュー] Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search

Chenyang Gao, Guanyu Cai|arXiv (Cornell University)|Jan 8, 2021

Video Surveillance and Tracking Methods参考文献 24被引用数 61

ひとこと要約

NAFS は文脈的な非局所アテンションを用いた全スケールの画像・テキスト表現を導入し、スケール間の適応的な整列を実現。CUHK-PEDES で最先端の結果を達成。

ABSTRACT

Text-based person search aims at retrieving target person in an image gallery using a descriptive sentence of that person. It is very challenging since modal gap makes effectively extracting discriminative features more difficult. Moreover, the inter-class variance of both pedestrian images and descriptions is small. So comprehensive information is needed to align visual and textual clues across all scales. Most existing methods merely consider the local alignment between images and texts within a single scale (e.g. only global scale or only partial scale) then simply construct alignment at each scale separately. To address this problem, we propose a method that is able to adaptively align image and textual features across all scales, called NAFS (i.e.Non-local Alignment over Full-Scale representations). Firstly, a novel staircase network structure is proposed to extract full-scale image features with better locality. Secondly, a BERT with locality-constrained attention is proposed to obtain representations of descriptions at different scales. Then, instead of separately aligning features at each scale, a novel contextual non-local attention mechanism is applied to simultaneously discover latent alignments across all scales. The experimental results show that our method outperforms the state-of-the-art methods by 5.53% in terms of top-1 and 5.35% in terms of top-5 on text-based person search dataset. The code is available at https://github.com/TencentYoutuResearch/PersonReID-NAFS

研究の動機と目的

テキストベースの人物検索を動機づけ、画像と記述の間のクラス間分布の小さな差異に対処する。
複数のスケールで画像とテキスト表現を学習し、包括的な跨モーダル手がかりを捉える。
全スケールの特徴を共同で整列させる適応的な整列機構を開発する。
検索結果をさらに精練する再ランキング手法を提案する。
CUHK-PEDES での大規模実験を通じて有効性を示す。

提案手法

階段状CNNバックボーンを導入して、改善された局所性を伴う全スケールの視覚表現を抽出する。
局所性制約付きBERTを用いて多スケールのテキスト表現を取得する。
すべてのスケールで画像とテキストを共同で整列させるための文脈的非局所アテンションを提案する。
画像対テキストとテキスト対画像の類似性を組み合わせたクロススケール整列損失（CSAL）を採用する。
最終ランキングを向上させるために視覚的隣接者による再ランキング（RVN）を組み込む。

実験結果

リサーチクエスチョン

RQ1テキストベースの人物検索のために画像とテキストの全スケール表現をいかに抽出できるか？
RQ2適応的な跨スケール整列は固定スケールの組み合わせを超えてマッチングを改善できるか？
RQ3スケール全体での跨モダル整列における文脈的非局所アテンションの影響は何か？
RQ4視覚的隣接者による再ランキングは検索性能をさらに向上させるか？

主な発見

全スケール表現を用いたNAFSはCUHK-PEDESでTop-1 59.94%、Top-5 79.86%を達成し、いくつかのベースラインを上回る。
RVNを用いたNAFSはTop-1を61.50%、Top-5を81.19%へとさらに向上させる。
スケール横断の共同整列は別々の整列を上回る（Top-1: 59.94% 対 57.98%）。
階段状ネットワーク、局所性制約付きBERT、文脈的非局所アテンションを組み込むことでベースラインより大幅な改善を達成。
アブレーション実験により、グローバルおよび最 finer スケール特徴に加えて中間スケール情報（尺度2）が有益であることが示された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。