QUICK REVIEW

[논문 리뷰] SimpleMatch: A Simple and Strong Baseline for Semantic Correspondence

Hailing Jin, Huiying Li|arXiv (Cornell University)|2026. 01. 18.

Advanced Image and Video Retrieval Techniques인용 수 0

한 줄 요약

SimpleMatch는 희소 매칭과 창 기반 로컬라이제이션으로 메모리 사용을 줄인 업샘플링 기반의 의미적 대응의 경량 베이스라인을 제시하여 입력 해상도가 더 낮은 상태에서도 최첨단 성능을 달성합니다.

ABSTRACT

Recent advances in semantic correspondence have been largely driven by the use of pre-trained large-scale models. However, a limitation of these approaches is their dependence on high-resolution input images to achieve optimal performance, which results in considerable computational overhead. In this work, we address a fundamental limitation in current methods: the irreversible fusion of adjacent keypoint features caused by deep downsampling operations. This issue is triggered when semantically distinct keypoints fall within the same downsampled receptive field (e.g., 16x16 patches). To address this issue, we present SimpleMatch, a simple yet effective framework for semantic correspondence that delivers strong performance even at low resolutions. We propose a lightweight upsample decoder that progressively recovers spatial detail by upsampling deep features to 1/4 resolution, and a multi-scale supervised loss that ensures the upsampled features retain discriminative features across different spatial scales. In addition, we introduce sparse matching and window-based localization to optimize training memory usage and reduce it by 51%. At a resolution of 252x252 (3.3x smaller than current SOTA methods), SimpleMatch achieves superior performance with 84.1% PCK@0.1 on the SPair-71k benchmark. We believe this framework provides a practical and efficient baseline for future research in semantic correspondence. Code is available at: https://github.com/hailong23-jin/SimpleMatch.

연구 동기 및 목표

저해상도 입력에서 효율적인 의미적 대응의 필요성을 제시한다.
다운샘플링으로 인한 인접 키포인트의 되돌릴 수 없는 융합을 완화하는 간단한 아키텍처를 제안한다.
희소 매칭과 창 기반 로컬라이제이션 등 메모리 효율적인 학습 전략을 도입한다.
축소된 해상도에서 표준 벤치마크에서 강한 경험적 성능을 보여준다.

제안 방법

깊은 특징을 추출하기 위해 공유 인코더를 사용한다.
공간 디테일을 1/4 해상도로 회복하기 위해 경량 업샘플링 디코더를 적용한다.
병렬된 전치 합성(Transposed Convolution)과 양선형 업샘플링으로 업샘플링 가지를 융합한 후, ConvBlock으로 정제한다.
소스 키포인트의 소수 집합과 모든 대상 위치 간의 코사인 유사도를 계산하여 희소 매칭을 수행한다.
대략 최대값을 둘러싼 k x k 이웃에서 키포인트 매치를 세밀하게 다듬기 위해 창(window) 기반 로컬라이제이션을 사용한다.
세 가지 디코더 해상도(1/16, 1/8, 1/4)를 감독하는 다중 스케일 손실로 학습한다.

Figure 1 : Feature map visualizations at different scales. The red dots represent keypoints.

실험 결과

연구 질문

RQ1가볍고 저해상도에 친화적인 간단한 아키텍처가 무거운 4D 디코더나 트랜스포머 없이도 경쟁력 있는 의미적 대응 성능을 달성할 수 있는가?
RQ2가벼운 디코더로 1/4 해상도로 업샘플링하는 것이 키포인트 구분성을 충분히 보존하여 정확한 매칭이 가능한가?
RQ3희소 매칭과 창 기반 로컬라이제이션이 정확도를 유지하면서 학습 메모리를 실질적으로 줄여주는가?
RQ4의미적 대응을 위한 표현 품질에 대한 다중 스케일 감독의 영향은 무엇인가?

주요 결과

SimpleMatch는 낮은 입력 해상도(예: 252x252)에서 강한 PCK 성능을 달성하고 SPair-71k에서 여러 SOTA 방법을 능가한다.
창 기반 로컬라이제이션과 희소 매칭을 결합할 때 학습 메모리를 약 51% 감소시킨다.
백본(ResNet101, iBOT, DINOv2) 전반에 걸쳐 SimpleMatch는 SPair-71k 및 PF-PASCAL에서 PCK@0.1로 경쟁력 있거나 우수한 성능을 달성하며, 특정 구성에서 65 images/s의 높은 효율성과 2.8 GB 메모리 등 주목할 만한 효율성을 보인다.
다중 스케일 감독은 성능을 향상시키고, 이를 제거하면 PCK@0.1의 측정 가능한 하락이 발생한다.
단순히 입력 해상도뿐 아니라 특징 맵 해상도를 증가시키면 성능이 더 크게 향상된다.

Figure 2 : Illustration of SimpleMatch structure . The architecture consists solely of a feature extractor and a lightweight upsampling decoder. After obtaining the source and target feature maps, we perform sparse matching and employ window-based localization to enhance training efficiency.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.