QUICK REVIEW

[논문 리뷰] TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification

Shengcai Liao, Ling Shao|arXiv (Cornell University)|2021. 05. 30.

Video Surveillance and Tracking Methods인용 수 38

한 줄 요약

TransMatcher는 간소화된 교차 영상 매칭 디코더와 글로벌 맥스 풀링으로 Transformers를 적용하여 효율적이고 일반화 가능한 person re-identification을 가능하게 하며, 여러 데이터셋에서 최첨단 성능을 달성한다.

ABSTRACT

Transformers have recently gained increasing attention in computer vision. However, existing studies mostly use Transformers for feature representation learning, e.g. for image classification and dense predictions, and the generalizability of Transformers is unknown. In this work, we further investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. Thus, we further design two naive solutions, i.e. query-gallery concatenation in ViT, and query-gallery cross-attention in the vanilla Transformer. The latter improves the performance, but it is still limited. This implies that the attention mechanism in Transformers is primarily designed for global feature aggregation, which is not naturally suitable for image matching. Accordingly, we propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity computation. Additionally, global max pooling and a multilayer perceptron (MLP) head are applied to decode the matching result. This way, the simplified decoder is computationally more efficient, while at the same time more effective for image matching. The proposed method, called TransMatcher, achieves state-of-the-art performance in generalizable person re-identification, with up to 6.1% and 5.7% performance gains in Rank-1 and mAP, respectively, on several popular datasets. Code is available at https://github.com/ShengcaiLiao/QAConv.

연구 동기 및 목표

Transformers가 일반화 가능한 person re-id를 위해 이미지 쌍 간 이미지 매칭 및 메트릭 학습을 수행할 수 있는지 조사합니다.
ViT와 일반 Transformer가 교차 이미지 매칭에 대한 한계를 평가합니다.
교차 이미지 매칭을 가능하게 하는 경량의, 유사도 중심 디코더를 제안합니다.
표준 및 합성 데이터셋에서 일반화 성능을 평가하고 SOTA 방법과 비교합니다.

제안 방법

쿼리 및 갤러리 이미지에서 특징을 추출하기 위해 ResNet 백본을 사용합니다.
쿼리와 갤러리를 개별적으로 Transformer 인코더로 인코딩하여 Q_n과 K_n을 얻습니다.
변환된 특징과 공유된 FC를 통해 쿼리-갤러리 유사도를 계산하는 간소화된 디코더를 적용하고, 그 후 글로벌 max pooling과 MLP 헤드를 통해 쌍별 점수를 산출합니다.
로컬 유사 매치를 가중하기 위해 학습 가능한 프리미어 점수 임베딩을 도입합니다.
잔차 유사 학습을 위해 N 계층의 디코더 출력을 융합합니다.
QAConv-GS 프레임워크에 따른 쌍별 메트릭 학습 objective로 학습합니다.

실험 결과

연구 질문

RQ1Vision Transformer 또는 일반 Transformer가 사람 재식별을 위한 이미지 쌍 간의 명시적 이미지 매칭으로 일반화할 수 있는가?
RQ2나이브한 솔루션(쿼리-갤러리 연결이나 입력 쿼리를 이용한 크로스 어텐션)이 교차 이미지 매칭을 개선하는가?
RQ3직접적인 유사도 계산에 초점을 맞춘 간소화된 디코더가 Re-ID의 메트릭 학습에서 효율성과 성능을 개선하는가?
RQ4크로스 이미지 상호작용이 데이터셋 및 합성 데이터에서 일반화에 미치는 영향은 무엇인가?

주요 결과

TransMatcher는 여러 데이터셋에서 일반화 가능한 사람 재식별 성능에서 최첨단에 도달합니다.
Market-1501을 소스로 사용할 경우 CUHK03-NP에서 Rank-1 5.8% 개선, mAP 5.7% 개선, MSMT17에서 Rank-1 6.1% 개선, mAP 3.4% 개선합니다.
MSMT17을 소스로 사용할 경우 Market-1501에서 Rank-1 5.0% 개선, mAP 5.3% 개선, MSMT17에서 Rank-1 6.1% 개선, mAP 3.4% 개선합니다(보고된 바에 따라).
RandPerson(합성 데이터)으로 학습하면 Market-1501에서 Rank-1 3.3% 개선, mAP 5.3% 개선, MSMT17에서 Rank-1 5.9% 개선, mAP 3.3% 개선합니다.
Transformer-Cross와 비교했을 때 TransMatcher는 교차 매칭 성능이 크게 향상됩니다(예: Market-1501에서 약 11% Rank-1 및 9% mAP 차이로 TransMatcher가 Transformer-Cross를 능가합니다).
축소된 디코더, GMP 하드 어텐션, 프리어 점수 임베딩의 중요성과 엔코더의 위치 임베딩이 이 설계에서 성능 저하를 초래할 수 있음을 보여주는 애멀레이션 연구가 있습니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.