QUICK REVIEW

[논문 리뷰] An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

Rosanne Liu, Joel Lehman|arXiv (Cornell University)|2018. 07. 09.

Neural Networks and Applications인용 수 645

한 줄 요약

이 논문은 CNN이 직교 좌표계와 픽셀 공간 간의 좌표 변환을 학습하는 데 어려움을 보이고, 좌표 채널을 입력에 추가하여 학습된 변환 의존 표현을 가능하게 하는 CoordConv를 제안한다. 이를 통해 속도와 매개변수 효율성이 향상된다.

ABSTRACT

Few ideas have enjoyed as large an impact on deep learning as convolution. For any problem involving pixels or spatial representations, common intuition holds that convolutional neural networks may be appropriate. In this paper we show a striking counterexample to this intuition via the seemingly trivial coordinate transform problem, which simply requires learning a mapping between coordinates in (x,y) Cartesian space and one-hot pixel space. Although convolutional networks would seem appropriate for this task, we show that they fail spectacularly. We demonstrate and carefully analyze the failure first on a toy problem, at which point a simple fix becomes obvious. We call this solution CoordConv, which works by giving convolution access to its own input coordinates through the use of extra coordinate channels. Without sacrificing the computational and parametric efficiency of ordinary convolution, CoordConv allows networks to learn either complete translation invariance or varying degrees of translation dependence, as required by the end task. CoordConv solves the coordinate transform problem with perfect generalization and 150 times faster with 10--100 times fewer parameters than convolution. This stark contrast raises the question: to what extent has this inability of convolution persisted insidiously inside other tasks, subtly hampering performance from within? A complete answer to this question will require further investigation, but we show preliminary evidence that swapping convolution for CoordConv can improve models on a diverse set of tasks. Using CoordConv in a GAN produced less mode collapse as the transform between high-level spatial latents and pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST showed 24% better IOU when using CoordConv, and in the RL domain agents playing Atari games benefit significantly from the use of CoordConv layers.

연구 동기 및 목표

표준 CNN이 데카르트 좌표에서 픽셀 좌표로의 좌표 변환을 학습하는 데 의외로 어려움을 보여준다.
좌표 정보를 제공하기 위한 드롭인(drop-in) 레이어로 CoordConv를 도입한다.
CoordConv가 더 적은 매개변수와 더 빠른 학습으로 이동 인지 표현(translation-aware representations)을 학습하도록 한다.
일반성 및 영향력을 평가하기 위해 토이(task) 작업과 실제 모델 전반에 걸쳐 CoordConv를 평가한다.

제안 방법

64x64 캔버스에 9x9 정사각형이 있고 예시마다 세 가지 필드(중심 좌표, 중심 픽셀 원-핫, 렌더링된 이미지)를 갖는 Not-so-Clevr 토이 데이터셋을 정의한다.
표준 컨볼루션 이전에 입력에 하드코딩된 좌표 채널을 추가하여 CoordConv 레이어를 제안하며, 필터가 직교 좌표에 접근하도록 한다.
균일한(train/test) 분할과 사분면 분할을 포함한 지도 학습 좌표 분류, 회귀 및 렌더링 작업에서 표준 합성곱 신경망과 CoordConv를 비교한다.
CoordConv가 적은 매개변수 추가로 효율성을 유지하고 학습에 의해 제어 가능한 이동-불변(translational-invariant) 동작을 보임을 보여준다.
이미지 분류, 객체 탐지, 생성 모델링, 강화 학습 등에 대한 영향을 평가하기 위해 더 넓은 모델에서 CoordConv를 드롭인 대체로 적용한다.

실험 결과

연구 질문

RQ1표준 합성곱으로 CNN이 직교 좌표에서 픽셀 공간 표현으로의 매핑을 효율적으로 학습할 수 있는가?
RQ2CoordConv를 통해 명시적 좌표 정보를 도입하는 것이 좌표 변환 학습 및 일반화에 도움이 되는가?
RQ3실제 모델(탐지기, GAN/VAEs, RL)에서 토이 작업을 넘어 CoordConv 계층이 이점을 제공하는가?

주요 결과

좌표 변환 과제는 감독하에 수행하더라도 표준 CNN에서 어렵고, 사분면 분할에서 거의 일반화가 나타나지 않는다.
CoordConv는 좌표 작업에서 정확히 학습 및 테스트 정확도를 달성하며 매개변수 수가 훨씬 적고 학습 속도도 훨씬 빠르다(초 단위 대 시간 단위).
합성곱을 CoordConv로 대체하면 MNIST 유사 객체 탐지(Faster R-CNN에서 IOU 24% 향상) 및 GAN/VAEs의 모드 붕괴 감소 등 다양한 설정에서 성능이 향상된다.
ImageNet 분류에서 CoordConv의 개선은 미미하여, 평행이동 불변 분류 작업에 대한 이점이 제한적임을 시사한다.
Atari RL 작업에서 CoordConv가 많은 게임에서 성능을 향상시키나 모든 게임에서 보편적이지는 않다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.