QUICK REVIEW

[论文解读] Axial Attention in Multidimensional Transformers

Jonathan Ho, Nal Kalchbrenner|arXiv (Cornell University)|Dec 20, 2019

Generative Adversarial Networks and Image Synthesis参考文献 18被引用 363

一句话总结

本文提出 Axial Transformers，一种用于高维数据的自回归自注意力模型，使用轴向注意力在单个张量轴上计算上下文，在 ImageNet-32/64 和 BAIR Robot Pushing 上实现了最新成果，无需自定义内核。

ABSTRACT

We propose Axial Transformers, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors. Existing autoregressive models either suffer from excessively large computational resource requirements for high dimensional data, or make compromises in terms of distribution expressiveness or ease of implementation in order to decrease resource requirements. Our architecture, by contrast, maintains both full expressiveness over joint distributions over data and ease of implementation with standard deep learning frameworks, while requiring reasonable memory and computation and achieving state-of-the-art results on standard generative modeling benchmarks. Our models are based on axial attention, a simple generalization of self-attention that naturally aligns with the multiple dimensions of the tensors in both the encoding and the decoding settings. Notably the proposed structure of the layers allows for the vast majority of the context to be computed in parallel during decoding without introducing any independence assumptions. This semi-parallel structure goes a long way to making decoding from even a very large Axial Transformer broadly applicable. We demonstrate state-of-the-art results for the Axial Transformer on the ImageNet-32 and ImageNet-64 image benchmarks as well as on the BAIR Robotic Pushing video benchmark. We open source the implementation of Axial Transformers.

研究动机与目标

在不过分的计算或内存开销下，建立一个基于自注意力的高维数据张量自回归模型。
引入轴向注意力，使注意力在张量轴上扩展而不将数据展平。
通过一种半并行采样程序实现全上下文建模。
在图像和视频基准上展示最新结果。
提供开源实现，便于采用。

提出的方法

将轴向注意力定义为在多维张量的单一轴上的注意力，保留其他轴，因而将计算从 O(N^2) 降至 O(N^{(d-1)/d})。
将有遮罩/无遮罩的轴向注意力块堆叠，构建完整的自回归上下文，而无需独立性假设。
使用逐行内部解码器提高采样效率，外部解码器将先前的行和通道信息合并进来。
通过使用额外的未遮罩的逐行/逐列注意力层，在前一个通道条件下建模多通道数据。
在随机通道切片上进行训练，以获得对完整数据张量对数似然的无偏估计。
提供 Axial Transformers 的开源实现。

实验结果

研究问题

RQ1如何将注意力机制推广到多维张量，以在保持完整联合表达能力的同时降低计算需求？
RQ2轴向注意力是否能够在不使用自定义内核或大量数据拷贝的情况下实现对图像和视频的高效自回归建模？
RQ3将遮罩/未遮罩轴向注意力结合对建模能力和采样速度有何影响？
RQ4相较于先前的自回归模型，Axial Transformer 在标准图像和视频基准上的表现如何？
RQ5模型是否能够通过对前一个通道/帧进行条件化，有效处理多通道数据和视频？

主要发现

模型	ImageNet-32（比特/维）	ImageNet-64（比特/维）
Multiscale PixelCNN	3.95	3.70
PixelCNN/RNN	3.86	3.63
Gated PixelCNN	3.83	3.57
PixelSNAIL	3.80	3.52
SPN	3.79	3.52
Image Transformer	3.77
Strided Sparse Transformer		3.44
Axial Transformer + LSTM inner decoder	3.77	3.46
Axial Transformer	3.76 (3.758)	3.44 (3.439)

轴向注意力将计算和内存从标准自注意力按 O(N^{(d-1)/d}) 的因子降低，针对 d 维输入张量。
Axial Transformer 在 ImageNet-32 和 ImageNet-64 上相比若干基线，达到最新的每维比特数（bits-per-dimension）。
该模型在 BAIR Robotic Pushing 视频建模方面显著优于之前的自回归方法。
半并行采样在大多数情况下并行计算上下文，带来对大张量的实用解码。
消融分析显示，用 LSTM 替换内部解码器会降低训练速度，但可以达到某些性能；完整四层内部解码器则提升了性能和训练速度。
通道条件扩展在没有重大架构修改的情况下，能有效建模多通道图像和视频。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。