QUICK REVIEW

[Paper Review] BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

Asa Cooper Stickland, Iain Murray|arXiv (Cornell University)|Feb 7, 2019

Topic Modeling113 citations

TL;DR

The paper introduces PALs (Projected Attention Layers), a parameter-efficient adaptation module that enables multi-task learning on top of a shared BERT-base model, achieving comparable GLUE performance with about 7x fewer parameters and state-of-the-art on RTE.

ABSTRACT

Multi-task learning shares information between related tasks, sometimes reducing the number of parameters required. State-of-the-art results across multiple natural language understanding tasks in the GLUE benchmark have previously used transfer from a single large task: unsupervised pre-training with BERT, where a separate BERT model was fine-tuned for each task. We explore multi-task approaches that share a single BERT model with a small number of additional task-specific parameters. Using new adaptation modules, PALs or `projected attention layers', we match the performance of separately fine-tuned models on the GLUE benchmark with roughly 7 times fewer parameters, and obtain state-of-the-art results on the Recognizing Textual Entailment dataset.

Motivation & Objective

Motivate and develop parameter-efficient multi-task learning on top of a large pre-trained transformer (BERT).
Propose PALs as low-ridelity, shared-parameter adaptations that augment self-attention layers.
Explore training schedules (sampling strategies) to mitigate task imbalance during multi-task learning.
Compare PALs against other adaptation modules and baselines on GLUE tasks to assess efficiency and performance.

Proposed method

Introduce Projected Attention Layers (PALs) as a low-dimensional, shared-encoder/decoder transformation applied within BERT layers or at the top.
Experiment with several adaptation strategies (PALs, low-rank layers, top/bottom additions) under a 1.13x parameter budget.
Use V^E and V^D encoder/decoder matrices with a reduced hidden size d_s to create the task-specific transformation g(·) in a shared fashion across tasks.
Evaluate on eight GLUE tasks with a multi-task training regime and annealed/sqrt sampling to balance tasks.
Compare against fine-tuned BERT-base and other adapters, reporting performance across MNLI, QQP, QNLI, SST-2, CoLA, STS-B, MRPC, and RTE.

Experimental results

Research questions

RQ1How can a single BERT base model be efficiently adapted to multiple tasks with a small number of task-specific parameters?
RQ2What is the impact of adding PALs or other adapters on GLUE performance relative to full fine-tuning and other adaptation strategies?
RQ3Where in the network should adaptation parameters be placed (top vs within layers) for best multi-task efficiency and performance?
RQ4What training-schedule strategies best mitigate task imbalance in multi-task learning?

Key findings

PALs achieve comparable performance to fine-tuned BERT-base on many GLUE tasks with ~7x fewer parameters.
PALs significantly improve RTE performance, achieving state-of-the-art results compared to BERT-large and MT-DNN baselines.
On large sentence-pair tasks (MNLI, QQP, QNLI), PALs match BERT-base performance with similar or slightly better results.
Within-task and cross-task parameter sharing strategies show that adapting every layer (with PALs or low-rank layers) generally yields better results than adapting only the top or a subset of layers.
Six-layer PALs (with shared V^E and V^D) and low-rank adapters provide strong performance within the 1.13x parameter budget.
Simple sharing across tasks (fully shared model) performs competitively, but task-specific pooling and top adaptations can reduce performance on some tasks like RTE.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.