QUICK REVIEW

[Paper Review] Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, Sebastian Borgeaud|arXiv (Cornell University)|Jul 30, 2021

Human Pose and Action Recognition98 references205 citations

TL;DR

Perceiver IO introduces a general-purpose neural network architecture that handles arbitrary structured inputs and outputs using a flexible attention-based querying mechanism, enabling linear scaling with input and output size. It achieves state-of-the-art performance on diverse tasks—including GLUE language benchmark and Sintel optical flow—without task-specific architecture design, outperforming BERT and specialized models despite removing input tokenization.

ABSTRACT

A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs. Our model augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II. As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.

Motivation & Objective

To develop a single neural network architecture that generalizes across diverse input modalities and output structures without task-specific engineering.
To address the limitations of existing models that scale poorly with input/output size or require modality-specific architectures.
To enable end-to-end learning for tasks with complex, structured outputs such as optical flow, audio, and symbolic reasoning.
To decouple the computational burden from input and output size by using a fixed-size latent space and attention-based decoding.
To demonstrate strong performance across multiple domains, including natural language, vision, multimodal, and reinforcement learning tasks.

Proposed method

Uses a read-process-write architecture: inputs are encoded into a fixed-size latent space via attention, refined through deep layers of self-attention, and decoded via query-based attention.
Employs a flexible querying mechanism where each output is generated by attending to the latent space using a query that specifies the semantics, size, and structure of the desired output.
Constructs queries using positional embeddings (Fourier or learned) and modality-specific embeddings to encode spatial, temporal, or semantic context for outputs.
Supports arbitrary output shapes and structures—e.g., scalar predictions, dense fields, sequences, or sets—by varying the query composition.
Uses a shared, domain-agnostic backbone for all inputs and outputs, minimizing architectural assumptions about spatial or locality structure.
Applies learned modality embeddings to input tokens and query tokens to distinguish between different modalities during encoding and decoding.

Experimental results

Research questions

RQ1Can a single neural network architecture handle diverse input modalities and structured outputs without architectural changes?
RQ2How can a model scale linearly with input and output size while maintaining high performance across heterogeneous tasks?
RQ3Can attention-based querying replace task-specific decoder heads in models like BERT or optical flow networks?
RQ4To what extent can a unified architecture outperform specialized models on tasks like language understanding, optical flow, and multimodal autoencoding?
RQ5How does the flexibility of query-based decoding affect performance on dense and multitask outputs?

Key findings

Perceiver IO outperforms BERT on the GLUE benchmark despite removing input tokenization, achieving a mean score of 85.7 compared to BERT's 84.8.
It achieves state-of-the-art performance on the Sintel optical flow benchmark, outperforming models with explicit multiscale correspondence mechanisms.
On the AutoFlow dataset, Perceiver IO achieves a final end absolute error (EAE) of 1.18, surpassing previous SOTA models on 480-epoch training.
In multimodal autoencoding on Kinetics700, Perceiver IO achieves a video L1 loss of 0.03, audio L1 loss of 1.0, and classification accuracy of 71.2%, demonstrating joint learning of video, audio, and labels.
The model generalizes across domains: it performs well on tasks ranging from text classification to dense prediction (e.g., optical flow) and symbolic reasoning (e.g., StarCraft II), with no architectural modifications.
Despite high input resolution (e.g., 2M+ raw points), Perceiver IO maintains performance through tiled evaluation and weighted averaging of overlapping tile predictions.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.