QUICK REVIEW

[论文解读] Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, Sebastian Borgeaud|arXiv (Cornell University)|Jul 30, 2021

Human Pose and Action Recognition参考文献 98被引用 205

一句话总结

Perceiver IO 引入了一种通用神经网络架构，通过灵活的基于注意力的查询机制处理任意结构化输入和输出，实现输入和输出规模的线性扩展。它在多种任务上达到最先进性能——包括 GLUE 语言基准和 Sintel 光学流任务——且无需针对任务设计专用架构，尽管去除了输入分词，仍优于 BERT 和专用模型。

ABSTRACT

A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs. Our model augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II. As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.

研究动机与目标

开发一种单一神经网络架构，无需任务特定工程即可在多种输入模态和输出结构间实现泛化。
解决现有模型在输入/输出规模增大时扩展性差或需要模态特定架构的局限性。
实现复杂结构化输出任务（如光学流、音频和符号推理）的端到端学习。
通过使用固定大小的潜在空间和基于注意力的解码机制，将计算负担与输入和输出规模解耦。
在自然语言、视觉、多模态和强化学习任务等多个领域展示强大性能。

提出的方法

采用读取-处理-写入架构：输入通过注意力机制编码为固定大小的潜在空间，经深层自注意力层优化，并通过基于查询的注意力机制解码。
采用灵活的查询机制，每个输出通过指定所需输出语义、大小和结构的查询，从潜在空间中注意获取。
使用位置嵌入（傅里叶或学习得到）和模态特定嵌入构建查询，以编码输出的空间、时间或语义上下文。
通过改变查询构成，支持任意输出形状和结构——例如标量预测、密集场、序列或集合。
对所有输入和输出使用共享的、与领域无关的主干网络，最小化对空间或局部性结构的架构假设。
在编码和解码过程中，对输入标记和查询标记应用学习得到的模态嵌入，以区分不同模态。

实验结果

研究问题

RQ1单一神经网络架构是否能在无需架构修改的情况下处理多样化输入模态和结构化输出？
RQ2如何在保持异质任务高性能的同时，使模型实现与输入和输出规模的线性扩展？
RQ3基于注意力的查询机制能否替代 BERT 或光学流网络等模型中的任务特定解码头？
RQ4统一架构在语言理解、光学流和多模态自编码等任务上，能在多大程度上超越专用模型？
RQ5基于查询的解码机制的灵活性对密集输出和多任务输出性能有何影响？

主要发现

尽管去除了输入分词，Perceiver IO 在 GLUE 基准上的平均得分达到 85.7，优于 BERT 的 84.8。
在 Sintel 光学流基准上，Perceiver IO 达到最先进性能，优于包含显式多尺度对应机制的模型。
在 AutoFlow 数据集上，Perceiver IO 在 480 个周期训练后达到最终端到端绝对误差（EAE）1.18，优于此前最先进模型。
在 Kinetics700 的多模态自编码任务中，Perceiver IO 实现视频 L1 损失 0.03、音频 L1 损失 1.0 和分类准确率 71.2%，展示了对视频、音频和标签的联合学习能力。
该模型具备跨领域泛化能力：在从文本分类到密集预测（如光学流）和符号推理（如 StarCraft II）的任务中均表现良好，且无需架构修改。
尽管输入分辨率极高（如超过 200 万个原始点），Perceiver IO 通过分块评估和重叠区域预测的加权平均，仍保持高性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。