QUICK REVIEW

[論文レビュー] Text-Based Person Search with Limited Data

Han Xiao, Sen He|arXiv (Cornell University)|Oct 20, 2021

Multimodal Machine Learning Applications参考文献 42被引用数 38

ひとこと要約

本論文は CM-MoCo というクロスモーダルモーメンタム対比学習フレームワークと、大規模な画像-テキストデータからの転移学習戦略を提案し、データが限定的な状況でのテキストベースの人物検索を改善し、CUHK-PEDES で最先端を達成する。

ABSTRACT

Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query. Solving such a fine-grained cross-modal retrieval task is challenging, which is further hampered by the lack of large-scale datasets. In this paper, we present a framework with two novel components to handle the problems brought by limited data. Firstly, to fully utilize the existing small-scale benchmarking datasets for more discriminative feature learning, we introduce a cross-modal momentum contrastive learning framework to enrich the training data for a given mini-batch. Secondly, we propose to transfer knowledge learned from existing coarse-grained large-scale datasets containing image-text pairs from drastically different problem domains to compensate for the lack of TBPS training data. A transfer learning method is designed so that useful information can be transferred despite the large domain gap. Armed with these components, our method achieves new state of the art on the CUHK-PEDES dataset with significant improvements over the prior art in terms of Rank-1 and mAP. Our code is available at https://github.com/BrandonHanx/TextReID.

研究の動機と目的

ラベル付き TBPS データの不足を、限られたベンチマークデータセットをより効果的に活用することで解決する。
モーメントベースの対比学習を通じてクロスモーダル negative サンプルを豊富にし、識別性を向上させる。
大規模な画像-テキスト対の知識を活用するため、ドメインギャップを緩和する慎重なクロスモーダル転移学習戦略を用いる。

提案手法

視覚的クエリエンコードとモーメンタムキーエンコード、視覚・テキスト・アイデンティティ用の専用キューを備えたクロスモーダルモメンタム対比学習（CM-MoCo）を導入する。
クエリエンコードをアンカー、キーエンコードをポジティブ、キューをネガティブとして用いるクロスモーダル対比損失を定式化する。
CM-MoCo を整列損失およびアイデンティティ損失と組み合わせたエンドツーエンドの訓練フレームワーク。
大型事前学習モデルからのテキストエンコーダを凍結し、Bi-GRU で文脈化することで語彙埋め込みを文脈化させ、ドメインギャップを橋渡しするクロスモーダル転移学習戦略を提案する。
クロスモーダル k- recipロール再ランキングで後処理を行い、検索性能をさらに向上させる。

実験結果

リサーチクエスチョン

RQ1CM-MoCo はバッチサイズからネガティブを切り離すことで、限定的な TBPS データを有効活用できるか？
RQ2大規模画像-テキスト事前学習からの知識移転は、ドメインギャップが大きい場合に TBPS の性能向上に寄与するか、そしてネガティブな転移を避けるためにはどのように転移を行うべきか？
RQ3CM-MoCo、アライメント、アイデンティティ損失のどの組み合わせが CUHK-PEDES における TBPS の最良の性能を生むか？
RQ4TBPS データと汎用の画像-テキストデータ間のドメインギャップを最も効果的に緩和するテキストモダリティの転移学習設計は何か？

主な発見

Method	Arch.	Dim.	Text to Image w/o Rerank Rank-1	Text to Image w/o Rerank Rank-5	Text to Image w/o Rerank Rank-10	Text to Image w/o Rerank mAP	Image to Text w/o Rerank Rank-1	Image to Text w/o Rerank Rank-5	Image to Text w/o Rerank Rank-10	Image to Text w/o Rerank mAP	Text to Image w/ Rerank Rank-1	Text to Image w/ Rerank Rank-5	Text to Image w/ Rerank Rank-10	Text to Image w/ Rerank mAP	Image to Text w/ Rerank Rank-1	Image to Text w/ Rerank Rank-5	Image to Text w/ Rerank Rank-10	Image to Text w/ Rerank mAP
Ours (ResNet50)	ResNet50	256	61.65	80.98	86.78	58.29	75.96	93.40	96.55	55.05	61.65	80.98	86.78	58.29	75.96	93.40	96.55	55.05
Ours (ResNet101)	ResNet101	256	64.08	81.73	88.19	60.08	78.99	95.02	97.17	56.78	64.08	81.73	88.19	60.08	78.99	95.02	97.17	56.78

CM-MoCo は CUHK-PEDES におけるテキスト⇔画像および画像⇔テキストの検索性能をベースラインより著しく改善する。
大規模な画像-テキストデータセットから語彙埋め込みだけを転移（凍結済みの CLIP テキストエンコーダと Bi-GRU の文脈化を介して）するだけで substantial gains が得られ、ネガティブ転移を回避できる。
CM-MoCo においてより大きなクロスモーダルキュー（例: 1024 または 2048）を使用すると一般に性能が向上するが、データ不足のため過大なキューは害となり得る。
テキストストリーム（二語埋め込みと文脈化）の提案転移戦略は、素直な全モデル転移よりも優れている。
CM-MoCo はモデル間で一貫して性能を向上させ、組み込むと Rank 指標の平均で約 1.5% の改善をもたらす。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。