QUICK REVIEW

[論文レビュー] An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

Rosanne Liu, Joel Lehman|arXiv (Cornell University)|Jul 9, 2018

Domain Adaptation and Few-Shot Learning参考文献 26被引用数 169

ひとこと要約

本論文は、CNN がデカルト空間とピクセル空間間の座標変換に苦戦することを示し、座標チャンネルを提供する CoordConv を導入し、様々なタスクでパフォーマンスが著しく向上すること（学習の高速化、一般化の向上、生成モデルにおけるモード崩壊の抑制）を実証している。

ABSTRACT

Few ideas have enjoyed as large an impact on deep learning as convolution. For any problem involving pixels or spatial representations, common intuition holds that convolutional neural networks may be appropriate. In this paper we show a striking counterexample to this intuition via the seemingly trivial coordinate transform problem, which simply requires learning a mapping between coordinates in (x,y) Cartesian space and one-hot pixel space. Although convolutional networks would seem appropriate for this task, we show that they fail spectacularly. We demonstrate and carefully analyze the failure first on a toy problem, at which point a simple fix becomes obvious. We call this solution CoordConv, which works by giving convolution access to its own input coordinates through the use of extra coordinate channels. Without sacrificing the computational and parametric efficiency of ordinary convolution, CoordConv allows networks to learn either complete translation invariance or varying degrees of translation dependence, as required by the end task. CoordConv solves the coordinate transform problem with perfect generalization and 150 times faster with 10--100 times fewer parameters than convolution. This stark contrast raises the question: to what extent has this inability of convolution persisted insidiously inside other tasks, subtly hampering performance from within? A complete answer to this question will require further investigation, but we show preliminary evidence that swapping convolution for CoordConv can improve models on a diverse set of tasks. Using CoordConv in a GAN produced less mode collapse as the transform between high-level spatial latents and pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST showed 24% better IOU when using CoordConv, and in the RL domain agents playing Atari games benefit significantly from the use of CoordConv layers.

研究の動機と目的

デカルト空間とピクセル空間間の座標変換問題を定義する。
座標情報を注入する単純なレイヤー拡張として CoordConv を提案する。
学習・一般化・効率を評価するために、玩具タスクと実世界のタスクで CoordConv を評価する。
画像生成、物体検出、強化学習における CoordConv の影響を示す。

提案手法

畳み込み前に入力に追加の座標チャネルとして CoordConv を導入する。
Not-so-Clevr データセットを用いて監視付き設定で座標変換を研究する。
分類・回帰・レンダリングタスクを対象に、標準畳み込み層と CoordConv を比較する。
GANs、VAE、Faster R-CNN、Atari の強化学習で CoordConv を評価し、より広い影響を測る。

実験結果

リサーチクエスチョン

RQ1標準的な CNN は監督付きでデカルト空間からピクセル空間への座標変換を効果的に学習できるか。
RQ2CoordConv を介して明示的な座標情報を追加することで、完全な一般化とより速い学習が可能か。
RQ3生成モデル、物体検出、強化学習タスクにおいて、CoordConv 層は通常の畳み込み層と比べて性能を改善するか。

主な発見

畳み込みネットワークは座標変換を完全には学習できず、象限分割で一般化が不十分である。
CoordConv は座標タスクで訓練・テストの精度を完璧に達成し、パラメータ数ははるかに少なく（約7.5k）、訓練もはるかに高速（秒対時間）である。
Replacing conv with CoordConv reduces mode collapse in GANs and enables more complete coverage of 2D latent spaces.
In Faster R-CNN, CoordConv yields 24% higher IOU on MNIST-like detection; in Atari RL, CoordConv improves performance on several games.
ImageNet classification shows negligible or non-significant gains from a single CoordConv layer, indicating task dependence of CoordConv benefits.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。