QUICK REVIEW

[论文解读] Contrastive Multiview Coding

Yonglong Tian, Dilip Krishnan|arXiv (Cornell University)|Jun 13, 2019

Image Enhancement Techniques参考文献 86被引用 571

一句话总结

This paper introduces Contrastive Multiview Coding (CMC), a self-supervised method that learns view-invariant representations by maximizing mutual information across multiple image channels or views, and shows state-of-the-art results on image and video benchmarks. More views improve representation quality, and contrastive learning outperforms cross-view prediction.

ABSTRACT

Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a "dog" can be seen, heard, and felt). We investigate the classic hypothesis that a powerful representation is one that models view-invariant factors. We study this hypothesis under the framework of multiview contrastive learning, where we learn a representation that aims to maximize mutual information between different views of the same scene but is otherwise compact. Our approach scales to any number of views, and is view-agnostic. We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics. Our approach achieves state-of-the-art results on image and video unsupervised learning benchmarks. Code is released at: http://github.com/HobbitLong/CMC/.

研究动机与目标

Motivate learning compact representations that capture shared, semantically meaningful information across multiple sensory views.
Develop a scalable multiview contrastive learning framework that maximizes mutual information between views.
Investigate how increasing the number of views affects representation quality.
Compare contrastive multiview learning to cross-view prediction and predictive learning.
Evaluate transferability of learned representations to downstream recognition and segmentation tasks.

提出的方法

Define M views V1,...,VM and encoders fi for each view to produce latent z_i = fi(v_i).
Use a contrastive objective that distinguishes positive pairs (congruent views of the same scene) from negative pairs (different scenes), via a score h_theta based on cosine similarity of z-vectors.
Two-view loss L_contrast^{V1,V2} is applied in both directions and summed to form L(V1,V2).
Extend to core-view and full-graph formulations for multiple views: core-view sums L(V1,Vj) for j>1; full-graph sums L(Vi,Vj) over all i<j.
Negative sampling and memory bank: approximate full softmax with a manageable number of negatives and store latent features in a memory bank for efficient contrasting.
Relate the optimal critic to density ratios and mutual information, with a bound I(z1;z2) ≥ log(k) − L_contrast, where k is the number of negatives.
Provide an empirical comparison showing that contrastive learning better captures shared information than predictive (reconstruction) learning across views.
Apply to images (e.g., Lab L vs ab channels; Y vs DbDr) and videos (RGB frames and optical flow) and extend to NYU-Depth-V2 with more views (L, ab, depth, surface normals).
Utilize a two-encoder architecture with a shared contrastive objective, optional memory bank, and data augmentations; evaluate transfer via linear probes and segmentation-style tasks.

实验结果

研究问题

RQ1Can contrastive multiview learning learn view-invariant, semantically meaningful representations across multiple image channels?
RQ2Does increasing the number of views improve the quality of learned representations for downstream tasks?
RQ3Is the contrastive objective superior to cross-view prediction or predictive learning for multiview representation learning?
RQ4How do core-view and full-graph multiview formulations trade off efficiency and information capture?
RQ5How well do CMC representations transfer to image classification, video recognition, and segmentation tasks?

主要发现

CMC achieves strong unsupervised performance on image and video benchmarks, approaching state-of-the-art in some settings.
Representation quality improves as the number of views increases (e.g., across NYU-Depth-V2 experiments).
The contrastive objective outperforms cross-view prediction and predictive learning in several view combinations and datasets.
Full-graph multiview formulations provide robust representations across all views and can approach supervised performance on some tasks.
On ImageNet, two-view CMC with luminance and chrominance (L,ab) or other color-space splits yields competitive top-1/top-5 accuracy; increasing model width and using additional views further improves results.
In video tasks, CMC with RGB frames and optical flow outperforms several baselines and improves transfer to action recognition datasets.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。