QUICK REVIEW

[論文レビュー] SimpleMatch: A Simple and Strong Baseline for Semantic Correspondence

Hailing Jin, Huiying Li|arXiv (Cornell University)|Jan 18, 2026

Advanced Image and Video Retrieval Techniques被引用数 0

ひとこと要約

SimpleMatchはスパースマッチングとウィンドウ局在化を用いた軽量なアップサンプリングベースのセマンティック対応のベースラインで、メモリ使用量を削減しつつ低入力解像度で最先端に近い結果を達成します。

ABSTRACT

Recent advances in semantic correspondence have been largely driven by the use of pre-trained large-scale models. However, a limitation of these approaches is their dependence on high-resolution input images to achieve optimal performance, which results in considerable computational overhead. In this work, we address a fundamental limitation in current methods: the irreversible fusion of adjacent keypoint features caused by deep downsampling operations. This issue is triggered when semantically distinct keypoints fall within the same downsampled receptive field (e.g., 16x16 patches). To address this issue, we present SimpleMatch, a simple yet effective framework for semantic correspondence that delivers strong performance even at low resolutions. We propose a lightweight upsample decoder that progressively recovers spatial detail by upsampling deep features to 1/4 resolution, and a multi-scale supervised loss that ensures the upsampled features retain discriminative features across different spatial scales. In addition, we introduce sparse matching and window-based localization to optimize training memory usage and reduce it by 51%. At a resolution of 252x252 (3.3x smaller than current SOTA methods), SimpleMatch achieves superior performance with 84.1% PCK@0.1 on the SPair-71k benchmark. We believe this framework provides a practical and efficient baseline for future research in semantic correspondence. Code is available at: https://github.com/hailong23-jin/SimpleMatch.

研究の動機と目的

低入力解像度での効率的なセマンティック対応の必要性を動機付ける。
ダウンサンプリングによる隣接キーポイントの不可逆的な融合を緩和する簡易なアーキテクチャを提案する。
メモリ効率の高い学習戦略（スパースマッチングとウィンドウベースの局在化）を導入する。
標準ベンチマークでの低解像度時の実証的性能を示す。

提案手法

共有エンコーダを用いて深い特徴を抽出する。
空間的ディテールを1/4解像度へ回復する軽量アップサンプリングデコーダを適用する。
並列転置畳み込みと二次元補間によるアップサンプリングブランチを融合し、ConvBlockリファインメントを行う。
小さなソースキーポイント集合と全ターゲット位置とのコサイン類似度を計算してスパースマッチングを行う。
コースな maxima の周囲の k x k 近傍内でキーポイントマッチを洗練するためのウィンドウベースの局在化を使用する。
3つのデコーダ解像度（1/16、1/8、1/4）を監督するマルチスケール損失で訓練する。

Figure 1 : Feature map visualizations at different scales. The red dots represent keypoints.

実験結果

リサーチクエスチョン

RQ14Dデコーダやトランスフォーマーを用いずに、単純で低解像度に適したアーキテクチャで競争力のあるセマンティック対応性能を達成できるか。
RQ2軽量デコーダで1/4解像度へアップサンプリングして、キーポイント識別性を十分に保持し、正確なマッチングを実現できるか。
RQ3スパースマッチングとウィンドウベースの局在化は訓練メモリを実質的に削減しつつ精度を維持できるか。
RQ4マルチスケール監督はセマンティック対応の表現品質にどのような影響を与えるか。

主な発見

SimpleMatchは低入力解像度（例: 252x252）で強力なPCK性能を達成し、SPair-71kでいくつかのSOTA手法を上回る。
ウィンドウベースの局在化とスパースマッチングを組み合わせると訓練メモリを約51%削減できる。
バックボーン（ResNet101、iBOT、DINOv2）を横断しても、SPair-71kとPF-PASCALで競争力あるまたは優れたPCK@0.1を達成し、効率性（特定設定で65枚/秒、2.8 GBメモリ）も顕著。
マルチスケール監督は性能を向上させる；それを除くとPCK@0.1で測定可能な低下を招く。
特徴マップの解像度を上げるほど性能向上が大きい（入力解像度だけを上げるよりも効果が顕著）。

Figure 2 : Illustration of SimpleMatch structure . The architecture consists solely of a feature extractor and a lightweight upsampling decoder. After obtaining the source and target feature maps, we perform sparse matching and employ window-based localization to enhance training efficiency.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。