QUICK REVIEW

[論文レビュー] An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

Rosanne Liu, Joel Lehman|arXiv (Cornell University)|Jul 9, 2018

Neural Networks and Applications被引用数 645

ひとこと要約

論文はCNNがデカルト座標とピクセル空間間の座標変換に苦労することを示し、CoordConvを導入する。CoordConvは入力に座標チャンネルを追加して翻訳依存の表現を学べるようにし、速度とパラメータ効率を改善する。

ABSTRACT

Few ideas have enjoyed as large an impact on deep learning as convolution. For any problem involving pixels or spatial representations, common intuition holds that convolutional neural networks may be appropriate. In this paper we show a striking counterexample to this intuition via the seemingly trivial coordinate transform problem, which simply requires learning a mapping between coordinates in (x,y) Cartesian space and one-hot pixel space. Although convolutional networks would seem appropriate for this task, we show that they fail spectacularly. We demonstrate and carefully analyze the failure first on a toy problem, at which point a simple fix becomes obvious. We call this solution CoordConv, which works by giving convolution access to its own input coordinates through the use of extra coordinate channels. Without sacrificing the computational and parametric efficiency of ordinary convolution, CoordConv allows networks to learn either complete translation invariance or varying degrees of translation dependence, as required by the end task. CoordConv solves the coordinate transform problem with perfect generalization and 150 times faster with 10--100 times fewer parameters than convolution. This stark contrast raises the question: to what extent has this inability of convolution persisted insidiously inside other tasks, subtly hampering performance from within? A complete answer to this question will require further investigation, but we show preliminary evidence that swapping convolution for CoordConv can improve models on a diverse set of tasks. Using CoordConv in a GAN produced less mode collapse as the transform between high-level spatial latents and pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST showed 24% better IOU when using CoordConv, and in the RL domain agents playing Atari games benefit significantly from the use of CoordConv layers.

研究の動機と目的

標準CNNがCartesian-to-pixel座標変換を学習する際の驚くべき難しさを実証する。
座標情報へのアクセスを提供するドロップイン層としてCoordConvを導入する。
CoordConvはパラメータを減らし、学習を高速化しつつ翻訳依存の表現を学習可能にすることを示す。
おもちゃタスクと実世界のモデルを横断してCoordConvを評価し、汎用性と影響を評価する。

提案手法

Not-so-Clevr おもちゃデータセットを定義する。64×64のキャンバス上に9×9の正方形があり、各例には3つのフィールド（中心座標、中心ピクセルのワンホット表現、レンダリング画像）を含む。
標準畳み込みの前に入力にハードコードされた座標チャンネルを追加することでCoordConv層を提案し、フィルターにCartesian座標へのアクセスを実質的に提供する。
標準の畳み込みネットワークとCoordConvを、均等・象限の訓練/テスト分割を用いた監視付き座標分類・回帰・レンダリングタスクで比較する。
CoordConvが小さなパラメータオーバーヘッドで効率を維持し、学習によって翻訳不変な挙動を制御できることを示す。
CoordConvを広範なモデルのドロップイン置換として適用し、画像分類、物体検出、生成モデル、強化学習への影響を評価する。

実験結果

リサーチクエスチョン

RQ1標準の畳み込みで、Cartesian座標からピクセル空間表現への写像をCNNが効率的に学習できるか？
RQ2CoordConvを介して明示的な座標情報を導入することは、座標変換の学習と一般化を改善するか？
RQ3CoordConv層は、 toys タスクを超えて実世界のモデル（検出器、GAN/VAE、RL）にも利点をもたらすか？

主な発見

座標変換タスクは標準CNNにとって supervise されていても難しく、象限分割ではほとんど一般化が見られない。
CoordConvは座標タスクに対して訓練・テストの精度を完璧に達成し、はるかに少ないパラメータで、訓練はseconds vs hoursと比較して大幅に高速。
畳み込みをCoordConvに置換することで、MNIST様の物体検出での精度向上（Faster R-CNNでのIOU 24%の改善）やGAN/VAEsのモード崩壊の減少など、多様な設定で性能が向上する。
ImageNet分類ではCoordConvの改善はほとんどなく、翻訳不変な分類タスクへの利点は限られていることを示す。
AtariのRLタスクでは多くのゲームで性能が向上するが、すべてのゲームで普遍的ではない。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。