QUICK REVIEW

[論文レビュー] A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

Qi You, Yitai Cheng|arXiv (Cornell University)|Feb 18, 2026

Advanced Neural Network Applications被引用数 0

ひとこと要約

この論文は、CLIP-MHAdapter を提案する。ボトルネック MLP とパッチトークン上のマルチヘッド自己注意を備えた軽量な CLIP 適応モジュールで、CLIP バックボーンを凍結したままグローバルStreetScapes上で訓練コストを低く抑えつつ、細粒度のStreet-view 属性分類の性能を向上または競合させる。

ABSTRACT

Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.

研究の動機と目的

大規模モデルの全微調整を伴わずに、細粒度の Street-view 属性分類を正確に行えるようにする。
軽量なパッチレベルの注意機構を用いて、雑然とした都市風景の局所手掛かりを CLIP と組み合わせて捉える。
backbone を凍結し、学習可能なモジュールを小規模に保つことでエッジデバイス向けの効率を維持する。
Street-view 属性データセットのクラス不均衡を、不均衡対応ウェイティング方式で軽減する。

提案手法

CLIP の視覚・テキストバックボーンを凍結し、パッチトークンにボトルネック MLP とマルチヘッド自己注意を追加。
パッチレベルの CLIP 埋め込みを処理し、層正規化を適用した後に MHSA でパッチ間の依存性をモデル化。
パッチ出力を平均プーリングで集約し、凍結済みのグローバル CLIP 特徴と残差係数 alpha で混合。
CLIP のコントラスト学習目標に従い、テキストエンコーダを介してクラス特異的分類器重みをテキストプロンプトで生成。
クロスエントロピー損失の不均衡ウェイトを使って学習することでクラス不均衡を緩和。
Global StreetScapes データセット上で、Accuracy、Macro-F1、Weighted-F1、Adjusted Balanced Accuracy を用いて評価。

実験結果

リサーチクエスチョン

RQ1軽量なパッチレベルの注意機構を持つアダプターは、既存の CLIP 適応法よりも細粒度の SVI 属性分類を改善し得るか。
RQ2CLIP バックボーンを保持しつつ小規模な MHAdapter を導入することで、雑然とした街路画像における精度と効率のトレードオフは良好か。
RQ3SVI 属性データセットに典型的なクラス不均衡の条件下で方法はどう性能を示すか。

主な発見

Contextual Attribute	Paradigm	Model	# T. Params	Acc.	Macro F1	Weighted F1	Bal. Acc.
Glare	Zero-shot Transfer	ZeroR-Trainer	-	97.21	49.29	95.84	0.00
Glare	Zero-shot CLIP	-	3.03	2.96	0.62	0.24	-
Glare	Vision Transformer	MaxViT	30.9M	94.09	63.15	95.03	39.59
Glare	Parameter-Efficient Adaptation	CLIP-Linear Probe	3K	95.51	53.61	95.24	6.48
Glare	CoOp	-	8K	96.60	57.27	95.98	10.89
Glare	CLIP-Adapter	-	0.52M	84.16	53.65	89.16	39.26
Glare	CLIP-MHAdapter	-	1.38M	95.32	63.68	95.69	32.63
Lighting Condition	Zero-shot Transfer	ZeroR-Trainer	-	64.66	26.18	50.79	0.00
Lighting Condition	Zero-shot CLIP	-	-	95.88	87.65	95.45	76.54
Lighting Condition	Vision Transformer	MaxViT	30.9M	96.23	90.55	96.15	84.50
Lighting Condition	Parameter-Efficient Adaptation	CLIP-Linear Probe	3K	89.48	69.22	88.67	55.07
Lighting Condition	CoOp	-	8K	94.77	81.50	93.92	68.23
Lighting Condition	CLIP-Adapter	-	0.52M	93.57	82.91	93.51	74.96
Lighting Condition	CLIP-MHAdapter	-	1.38M	96.46	90.29	96.35	83.83
Panoramic Status	Zero-shot Transfer	ZeroR-Trainer	-	95.49	48.85	93.28	0.00
Panoramic Status	Zero-shot CLIP	-	-	11.92	11.85	14.18	7.76
Panoramic Status	Vision Transformer	MaxViT	30.9M	99.95	99.73	99.95	99.95
Panoramic Status	Parameter-Efficient Adaptation	CLIP-Linear Probe	3K	87.75	67.79	90.86	87.17
Panoramic Status	CoOp	-	8K	98.94	94.32	98.98	95.97
Panoramic Status	CLIP-Adapter	-	0.52M	93.69	77.60	94.87	92.42
Panoramic Status	CLIP-MHAdapter	-	1.38M	99.40	96.70	99.42	98.40
Platform	Zero-shot Transfer	ZeroR-Trainer	-	31.69	8.02	15.25	0.00
Platform	Zero-shot CLIP	-	-	60.98	43.19	60.80	45.99
Platform	Vision Transformer	MaxViT	30.9M	68.28	56.69	69.21	49.87
Platform	Parameter-Efficient Adaptation	CLIP-Linear Probe	3K	63.14	52.88	64.20	66.11
Platform	CoOp	-	8K	65.04	58.82	61.64	65.82
Platform	CLIP-Adapter	-	0.52M	68.12	57.15	69.21	71.44
Platform	CLIP-MHAdapter	-	1.38M	69.12	60.79	67.27	64.93
Quality	Zero-shot Transfer	ZeroR-Trainer	-	90.84	31.73	86.48	0.00
Quality	Zero-shot CLIP	-	-	7.40	7.32	8.07	1.43
Quality	Vision Transformer	MaxViT	30.9M	79.88	40.95	83.41	27.32
Quality	Parameter-Efficient Adaptation	CLIP-Linear Probe	3K	86.57	53.18	87.41	33.23
Quality	CoOp	-	8K	92.03	42.96	89.79	11.56
Quality	CLIP-Adapter	-	0.52M	78.69	50.80	82.99	43.80
Quality	CLIP-MHAdapter	-	1.38M	89.08	61.46	89.62	43.78
Reflection	Zero-shot Transfer	ZeroR-Trainer	-	72.58	42.06	61.05	0.00
Reflection	Zero-shot CLIP	-	-	60.26	46.35	58.69	-6.37
Reflection	Vision Transformer	MaxViT	30.9M	78.72	75.67	79.56	57.61
Reflection	Parameter-Efficient Adaptation	CLIP-Linear Probe	3K	74.94	68.19	74.81	36.02
Reflection	CoOp	-	8K	74.66	58.75	70.32	17.10
Reflection	CLIP-Adapter	-	0.52M	58.75	45.90	57.81	-7.70
Reflection	CLIP-MHAdapter	-	1.38M	76.69	64.93	74.10	26.97
View Direction	Zero-shot Transfer	ZeroR-Trainer	-	88.52	46.95	83.13	0.00
View Direction	Zero-shot CLIP	-	-	37.77	35.62	44.69	16.52
View Direction	Vision Transformer	MaxViT	30.9M	87.38	77.99	89.06	82.35
View Direction	Parameter-Efficient Adaptation	CLIP-Linear Probe	3K	89.51	76.96	90.06	60.65
View Direction	CoOp	-	8K	92.89	80.87	92.55	56.56
View Direction	CLIP-Adapter	-	0.52M	87.57	76.29	88.89	69.39
View Direction	CLIP-MHAdapter	-	1.38M	95.28	87.95	95.19	73.19
Weather	Zero-shot Transfer	ZeroR-Trainer	-	23.90	7.72	9.22	0.00
Weather	Zero-shot CLIP	-	-	74.43	69.33	74.13	77.95
Weather	Vision Transformer	MaxViT	30.9M	75.47	59.90	74.18	51.04
Weather	Parameter-Efficient Adaptation	CLIP-Linear Probe	3K	57.04	59.39	56.78	56.80
Weather	CoOp	-	8K	84.87	85.92	84.82	82.64
Weather	CLIP-Adapter	-	0.52M	88.01	87.69	88.08	86.72
Weather	CLIP-MHAdapter	-	1.38M	81.84	85.08	82.04	83.6

CLIP-MHAdapter は Global StreetScapes の8属性において、完全訓練済みベースラインと競合またはそれを上回る精度を達成。
約1.4M の訓練可能パラメータを使用し、完全微調整と比較して大幅にパラメータを削減、効率性が著しく改善。
MHAdapter はパッチ間の依存関係と局所的な空間手掛かりを効果的に捉え、細粒度属性認識を向上。
不均衡対応ウェイティングはクラス間の性能バイアスを緩和し、評価全体の公平性を高める。
CLIP-MHAdapter におけるプロンプトベースのテキスト分類器は凍結済みテキストエンコーダを活用し、安定した跨モーダル整合を実現。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。