[论文解读] An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution
论文显示 CNN 在笛卡尔坐标与像素空间之间的坐标变换方面存在困难,并引入 CoordConv,在输入中附加坐标通道以实现学习的平移相关表示,从而提高速度和参数效率。
Few ideas have enjoyed as large an impact on deep learning as convolution. For any problem involving pixels or spatial representations, common intuition holds that convolutional neural networks may be appropriate. In this paper we show a striking counterexample to this intuition via the seemingly trivial coordinate transform problem, which simply requires learning a mapping between coordinates in (x,y) Cartesian space and one-hot pixel space. Although convolutional networks would seem appropriate for this task, we show that they fail spectacularly. We demonstrate and carefully analyze the failure first on a toy problem, at which point a simple fix becomes obvious. We call this solution CoordConv, which works by giving convolution access to its own input coordinates through the use of extra coordinate channels. Without sacrificing the computational and parametric efficiency of ordinary convolution, CoordConv allows networks to learn either complete translation invariance or varying degrees of translation dependence, as required by the end task. CoordConv solves the coordinate transform problem with perfect generalization and 150 times faster with 10--100 times fewer parameters than convolution. This stark contrast raises the question: to what extent has this inability of convolution persisted insidiously inside other tasks, subtly hampering performance from within? A complete answer to this question will require further investigation, but we show preliminary evidence that swapping convolution for CoordConv can improve models on a diverse set of tasks. Using CoordConv in a GAN produced less mode collapse as the transform between high-level spatial latents and pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST showed 24% better IOU when using CoordConv, and in the RL domain agents playing Atari games benefit significantly from the use of CoordConv layers.
研究动机与目标
- 展示标准CNN在学习笛卡尔坐标到像素坐标变换方面的意外难度。
- 将CoordConv作为可直接使用的层引入,使模型获得坐标信息。
- 展示CoordConv在参数更少、训练更快的情况下实现对平移敏感的表示的学习。
- 在 toy 任务和实际模型上评估 CoordConv,以评估泛化性和影响。
提出的方法
- 定义 Not-so-Clevr 玩具数据集,画布为64x64,包含9x9的方块,以及每个样本的三个字段(质心坐标、质心像素的一热编码,以及渲染图像)。
- 提出 CoordConv 层,通过在标准卷积之前为输入添加硬编码的坐标通道,实质上使卷积核能够访问笛卡尔坐标。
- 在带有统一和象限划分的训练/测试分割下,对监督的坐标分类、回归和渲染任务,比较标准卷积网络与 CoordConv。
- 证明 CoordConv 以很小的参数开销维持高效,并且可通过学习控制平移不变行为。
- 在更广泛的模型中作为可替换的实现来应用 CoordConv,以评估其对图像分类、目标检测、生成建模和强化学习的影响。
实验结果
研究问题
- RQ1在标准卷积下,CNNs 是否能够高效地学习从笛卡尔坐标到像素空间表示的映射?
- RQ2通过 CoordConv 引入显式坐标信息是否能够提升对坐标变换的学习和泛化?
- RQ3CoordConv 层在真实世界模型(检测器、GAN/VAEs、RL)中是否提供超越玩具任务的收益?
主要发现
- 即使有监督,坐标变换任务对标准CNN也很困难,象限划分几乎没有泛化。
- CoordConv 在坐标任务上实现了训练和测试的完美准确性,参数显著更少,训练速度更快(几秒对比几小时)。
- 用 CoordConv 替代卷积在多种设置中提升性能,包括类似 MNIST 的目标检测(Faster R-CNN 的 IOU 提升 24%)以及在 GAN/VAEs 中减少模式塌陷。
- 在 ImageNet 分类中,CoordConv 提供的改进微不足道,表明对平移不变的分类任务收益有限。
- 在 Atari 强化学习任务中,CoordConv 在多数游戏上提升了性能,虽然并非所有游戏都如此。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。