QUICK REVIEW

[Paper Review] White-Box Transformers via Sparse Rate Reduction

Yaodong Yu, Sam Buchanan|arXiv (Cornell University)|Jun 1, 2023

Advanced Neural Network Applications23 citations

TL;DR

The paper unifies transformer-like layers as unrolled steps optimizing a sparse rate reduction objective, yielding a fully interpretable white-box architecture (CRATE) that compresses and sparsifies token representations and performs competitively with engineered transformers on large-scale vision data.

ABSTRACT

In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. The quality of the final representation can be measured by a unified objective function called sparse rate reduction. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally. Particularly, we show that the standard transformer block can be derived from alternating optimization on complementary parts of this objective: the multi-head self-attention operator can be viewed as a gradient descent step to compress the token sets by minimizing their lossy coding rate, and the subsequent multi-layer perceptron can be viewed as attempting to sparsify the representation of the tokens. This leads to a family of white-box transformer-like deep network architectures which are mathematically fully interpretable. Despite their simplicity, experiments show that these networks indeed learn to optimize the designed objective: they compress and sparsify representations of large-scale real-world vision datasets such as ImageNet, and achieve performance very close to thoroughly engineered transformers such as ViT. Code is at \url{https://github.com/Ma-Lab-Berkeley/CRATE}.

Motivation & Objective

Motivate representation learning as compressing a data distribution to a mixture of low-dimensional subspaces and sparsifying representations.
Introduce a unified sparse rate reduction objective that combines lossy coding rate with sparsity to learn compact token representations.
Derive transformer-like layers as unrolled optimization steps, providing mathematical interpretability for attention and MLP blocks.
Propose CRATE (Coding RAte TransformEr) as a white-box architecture with layer-wise probabilistic models for distributions and dictionaries learned from data.
Demonstrate that CRATE can learn to compress and sparsify representations on large-scale vision data and approach ViT-style performance.

Proposed method

Define a unified objective: maximize sparse rate reduction, combining rate reduction and an ℓ0 sparsity penalty, with Z = f(X).
Model token distributions as mixtures of low-dimensional subspaces with learned bases U[K] per layer.
Derive self-attention-like updates as gradient steps to minimize the coding rate against the subspace mixture (MSSA).
Represent sparsification via ISTA-like updates against a learned dictionary D to promote sparsity in Z.
Construct CRATE by stacking layers that perform MSSA-based compression followed by ISTA-based sparsification, with layer-specific U[K] and D learned end-to-end.
Provide code link for replication: https://github.com/Ma-Lab-Berkeley/CRATE

Experimental results

Research questions

RQ1Can a rate-reduction objective with sparsity produce compact, interpretable token representations?
RQ2Do white-box transformer layers derived from unrolled optimization achieve competitive performance on large-scale vision tasks?
RQ3Can self-attention and MLP blocks be reinterpreted as denoising/compression and sparse coding steps within a unified framework?
RQ4What is the impact of layer-wise learned subspace bases and dictionaries on representation quality and transferability?
RQ5How faithfully do the proposed MSSA and ISTA blocks align with the intended optimization objectives during training?

Key findings

CRATE layers implement incremental optimization that compresses token distributions toward a mixture of subspaces and sparsifies representations.
The MSSA component corresponds to a gradient-step like operation that resembles self-attention but derived from rate-reduction denoising against subspaces.
The ISTA-based sparsification layer promotes sparsity with a learned dictionary, enabling a tractable approximation to rate-based diversity.
Experiments on ImageNet-1K show CRATE learns to compress and sparsify representations with performance close to engineered transformers like ViT.
Layer-wise analysis indicates both compression and sparsification improve across layers, supporting the intended objective-driven design.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.