QUICK REVIEW

[論文レビュー] Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues

Marco Willi, Melanie Mathys|arXiv (Cornell University)|Feb 12, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

この論文はCLIPベースの合成画像検出器を分析し、SynthCLICペア Datasetを紹介し、CLIPが実画像と合成画像の分類に依存する高レベル意味的手掛かりを説明し、生成モデルのタイプによって一般化の程度が異なることを示す。

ABSTRACT

Recent generative models produce near-photorealistic images, challenging the trustworthiness of photographs. Synthetic image detection (SID) has thus become an important area of research. Prior work has highlighted how synthetic images differ from real photographs--unfortunately, SID methods often struggle to generalize to novel generative models and often perform poorly in practical settings. CLIP, a foundational vision-language model which yields semantically rich image-text embeddings, shows strong accuracy and generalization for SID. Yet, the underlying relevant cues embedded in CLIP-features remain unknown. It is unclear, whether CLIP-based detectors simply detect strong visual artifacts or exploit subtle semantic biases, both of which would render them useless in practical settings or on generative models of high quality. We introduce SynthCLIC, a paired dataset of real photographs and high-quality synthetic counterparts from recent diffusion models, designed to reduce semantic bias in SID. Using an interpretable linear head with de-correlated activations and a text-grounded concept-model, we analyze what CLIP-based detectors learn. CLIP-based linear detectors reach 0.96 mAP on a GAN-based benchmark but only 0.92 on our high-quality diffusion dataset SynthCLIC, and generalization across generator families drops to as low as 0.37 mAP. We find that the detectors primarily rely on high-level photographic attributes (e.g., minimalist style, lens flare, or depth layering), rather than overt generator-specific artifacts. CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures. This highlights the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust SID.

研究の動機と目的

高忠実度の生成モデルによりSIDを信頼と安全性の課題として位置づける。
意味的バイアスを抑制し、拡散モデル全体で堅牢な評価を可能にする SynthCLIC を導入する。
解釈可能な線形ヘッドと概念ベースの語彙を用いて、CLIPベース検出器が何を学習するかを調査する。
GANと拡散モデル生成器の間でCLIPベースSIDの一般化を評価する。

提案手法

Frozen な CLIP Vision エンコーダ (ViT-L/14-336) を用い、[CLS] トークンを低次元空間へ投影する2つの学習可能な線形層を追加する。
投影活性に直交性制約を課し、デカラーレートされた解釈可能な特徴を促進する。
写真に焦点をあてた語彙を用いた概念モデリングフレームワーク（疎な線形CDMs）を適用して視覚的手掛かりを同定する。
CLIP のテキスト空間に学習表現を結びつけ、投影方向を語彙埋め込み（TextSpan および反意語ベースの語彙）と比較して Grounding を行う。
3つのデータセット（CNNSpot, SynthBuster+, SynthCLIC）とクロスデータセット/一般化テストを用いて mAP およびアブレーションを評価する。

Figure 1: Synthetic images—even those generated by recent, high-quality generative models—differ from real photographs in subtle aspects. The figure shows a real image (left) and four paired synthetic variants from the SynthCLIC dataset. Shown are the most relevant terms (absolute logit contribution

実験結果

リサーチクエスチョン

RQ1Q1 CLIPベース検出器は、GANベースの合成画像から現代の拡散ベースの合成画像へ現実的なペアデータセットでどの程度スケールするのか。
RQ2Q2 SID に特化したオーソゴナルな線形ヘッドおよび/または人間が解釈できる概念を用いることで、CLIPベースの分類は説明可能か。
RQ3Q3 CLIP 表現のどの視覚的・写真的属性が、データセット間で実画像と合成画像の判別を促進するのか。

主な発見

Training Set	Test Set	CNNSpot	SynthBuster+	SynthCLIC	Combined
CNNSpot	CNNSpot	0.96	0.67	0.37	0.84
CNNSpot	SynthBuster+	0.66	0.99	0.79	0.96
CNNSpot	SynthCLIC	0.56	0.64	0.92	0.87
SynthBuster+	CNNSpot	0.97	0.67	0.38	0.84
SynthBuster+	SynthBuster+	0.61	0.99	0.78	0.96
SynthBuster+	SynthCLIC	0.52	0.64	0.92	0.88

CLIPベース検出器は CNNSpot（GANベース）で0.96 mAP、SynthCLIC（拡散ベース）で0.92という性能を達成。
データセット間の一般化は弱く、データセット間の一般化は0.37 mAPまで低下する。
検出器は明示的な生成器固有のアーティファクトよりも、最小限主義スタイル、レンズ効果、深度レイヤリングといった高レベルの写真的属性に依存している。
CLIP の特徴上の線形ヘッドはほぼ直交しており、SID に寄与する複数の異なる要因が存在することを示唆する。
SynthCLIC は以前のペアデータセットと比較して意味的バイアスを減らすが、生成器ファミリー間の一般化は依然として不均一。
語彙を用いて学習方向を解釈すると、深度レイヤリングやミニマリズムといった知覚的手掛かりと結びつくことが明らかになり、合成画像で観察されるアーティファクトと一致する。
データセット間で投影次元を k（2 〜 16）に変化させても mAP への影響は限定的（絶対値で ≤0.03）である。

Figure 2: Examples from the SynthBuster+ dataset. Different paired images are shown in each row. Each column depicts a different image source, starting with real photographs from the RAISE-1K dataset [ undefo ] , followed by synthetic images from the Synthbuster dataset [ undefb ] and images added b

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。