QUICK REVIEW

[論文レビュー] Learning Representations by Maximizing Mutual Information Across Views

Philip Bachman, R Devon Hjelm|arXiv (Cornell University)|Jun 3, 2019

Machine Learning and Algorithms参考文献 45被引用数 678

ひとこと要約

AMDIM は自己教師あり画像表現を、拡張ビューとマルチスケールビュー間で mutual information を最大化することにより学習し、ImageNet での線形精度 68.1% を達成し、STL10 および Places205 で強力な結果を示す。

ABSTRACT

We propose an approach to self-supervised representation learning based on maximizing mutual information between features extracted from multiple views of a shared context. For example, one could produce multiple views of a local spatio-temporal context by observing it from different locations (e.g., camera positions within a scene), and via different modalities (e.g., tactile, auditory, or visual). Or, an ImageNet image could provide a context from which one produces multiple views by repeatedly applying data augmentation. Maximizing mutual information between features extracted from these views requires capturing information about high-level factors whose influence spans multiple views -- e.g., presence of certain objects or occurrence of certain events. Following our proposed approach, we develop a model which learns image representations that significantly outperform prior methods on the tasks we consider. Most notably, using self-supervised learning, our model learns representations which achieve 68.1% accuracy on ImageNet using standard linear evaluation. This beats prior results by over 12% and concurrent results by 7%. When we extend our model to use mixture-based representations, segmentation behaviour emerges as a natural side-effect. Our code is available online: https://github.com/Philip-Bachman/amdim-public.

研究の動機と目的

ラベル付きデータへの依存を減らすための無監督表現学習を動機づける。
複数のビューからの相互情報量に基づく自己教師付き目的を開発する。
拡張ビュー、マルチスケール予測、およびより強力なエンコーダを備えた以前の local DIM を拡張する。
セグメンテーションのような挙動を生み出す可能性のある混合ベースの表現を探る。
標準的なビジョンベンチマークで最先端の性能を示す。

提案手法

ローカル Deep InfoMax (DIM) を Augmented Multiscale DIM (AMDIM) に拡張する。
入力の独立に拡張されたコピーからの特徴間の相互情報量を最大化する。
複数の特徴スケールに渡って予測する (multiscale infomax)。
より強力なエンコーダアーキテクチャと負のサンプルを用いた対照的 NCE の境界。
同じ文脈の多様なビューを作成するためにデータ拡張を取り入れる。
エントロピー正則化項を伴う混合ベースの表現を導入する。

実験結果

リサーチクエスチョン

RQ1拡張ビュー間での相互情報量を最大化することは、従来の自己教師あり法より学習された表現を改善しますか？
RQ2複数のスケールと混合ベースの特徴を取り入れることは、性能と出現する挙動にどう影響しますか？
RQ3データ拡張戦略と NCE 正則化が表現の質に与える影響は何ですか？
RQ4AMDIM は ImageNet のような大規模データセットにスケールさせることができ、Places205 のような他のデータセットへ転移できますか？

主な発見

AMDIM は ImageNet で線形評価で 68.1% の精度を達成し、従来の結果を 12% 以上上回る。
AMDIM は STL10 で線形評価で 94% 以上の精度に達し、エンコーダの微調整を行わない。
Places205 では 55% の精度を達成し、従来の最高を 7% 上回る。
マルチスケールおよび拡張ベースのビューは、ベースラインの Local DIM と比較して性能を大幅に向上させる。
混合ベースの表現はセグメンテーションのような挙動を示し、STL10 のタスクで潜在的な利得をもたらす。
本手法は CIFAR-10/100, STL10, ImageNet, Places205 で実証され、競争力のある結果を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。