QUICK REVIEW

[论文解读] SimpleMatch: A Simple and Strong Baseline for Semantic Correspondence

Hailing Jin, Huiying Li|arXiv (Cornell University)|Jan 18, 2026

Advanced Image and Video Retrieval Techniques被引用 0

一句话总结

tldr：SimpleMatch 提出了一种基于轻量上采样的语义对应基线，通过稀疏匹配和窗口化本地化来降低内存使用，在较低输入分辨率下实现了最先进的结果。

ABSTRACT

Recent advances in semantic correspondence have been largely driven by the use of pre-trained large-scale models. However, a limitation of these approaches is their dependence on high-resolution input images to achieve optimal performance, which results in considerable computational overhead. In this work, we address a fundamental limitation in current methods: the irreversible fusion of adjacent keypoint features caused by deep downsampling operations. This issue is triggered when semantically distinct keypoints fall within the same downsampled receptive field (e.g., 16x16 patches). To address this issue, we present SimpleMatch, a simple yet effective framework for semantic correspondence that delivers strong performance even at low resolutions. We propose a lightweight upsample decoder that progressively recovers spatial detail by upsampling deep features to 1/4 resolution, and a multi-scale supervised loss that ensures the upsampled features retain discriminative features across different spatial scales. In addition, we introduce sparse matching and window-based localization to optimize training memory usage and reduce it by 51%. At a resolution of 252x252 (3.3x smaller than current SOTA methods), SimpleMatch achieves superior performance with 84.1% PCK@0.1 on the SPair-71k benchmark. We believe this framework provides a practical and efficient baseline for future research in semantic correspondence. Code is available at: https://github.com/hailong23-jin/SimpleMatch.

研究动机与目标

Motivate the need for efficient semantic correspondence at low input resolutions.
Propose a simple architecture that mitigates irreversible fusion of adjacent keypoints due to downsampling.
Introduce memory-efficient training strategies (sparse matching and window-based localization).
Demonstrate strong empirical performance across standard benchmarks at reduced resolutions.

提出的方法

Use a shared encoder to extract deep features.
Apply a lightweight upsampling decoder to recover spatial detail to 1/4 resolution.
Fuse upsampling branches via parallel transposed convolution and bilinear upsampling, followed by a ConvBlock refinement.
Perform sparse matching by computing cosine similarities between a small set of source keypoints and all target locations.
Use window-based localization to refine keypoint matches within a k x k neighborhood sized around coarse maxima.
Train with a multi-scale loss supervising three decoder resolutions (1/16, 1/8, 1/4).

Figure 1 : Feature map visualizations at different scales. The red dots represent keypoints.

实验结果

研究问题

RQ1Can a simple, low-resolution-friendly architecture achieve competitive semantic correspondence performance without heavy 4D decoders or transformers?
RQ2Does upsampling to 1/4 resolution with a lightweight decoder preserve keypoint discriminability sufficiently for accurate matching?
RQ3Do sparse matching and window-based localization substantially reduce training memory while maintaining accuracy?
RQ4What is the impact of multi-scale supervision on representation quality for semantic correspondence?

主要发现

SimpleMatch 在较低输入分辨率（例如 252x252）下实现了强的 PCK 性能，并在 SPair-71k 上超越了若干 SOTA 方法。
通过结合窗口化本地化和稀疏匹配，显著降低训练内存约 51%。
在不同骨干网络（ResNet101、iBOT、DINOv2）下，SimpleMatch 在 SPair-71k 和 PF-PASCAL 上达到具有竞争力或更优的 PCK@0.1，并具有显著的效率优势（某些设置下 65 张图像/秒，2.8 GB 内存）。
多尺度监督提升了性能；移除它会导致 PCK@0.1 的可观下降。
提升特征图分辨率（而不仅仅是输入分辨率）对性能带来更显著的提升。

Figure 2 : Illustration of SimpleMatch structure . The architecture consists solely of a feature extractor and a lightweight upsampling decoder. After obtaining the source and target feature maps, we perform sparse matching and employ window-based localization to enhance training efficiency.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。