Skip to main content
QUICK REVIEW

[Paper Review] Standing on the Shoulders of Giant Frozen Language Models

Yoav Levine, Itay Dalmedigos|arXiv (Cornell University)|Apr 21, 2022
Topic Modeling20 citations
TL;DR

The paper shows that three novel frozen-model methods—input-dependent prompt tuning, frozen readers, and recursive LMs—can match or exceed fine-tuned models on challenging NLP tasks without updating the base model weights. It argues frozen LMs retain versatility and introduces practical, higher-cost but effective techniques to leverage them.

ABSTRACT

Huge pretrained language models (LMs) have demonstrated surprisingly good zero-shot capabilities on a wide variety of tasks. This gives rise to the appealing vision of a single, versatile model with a wide range of functionalities across disparate applications. However, current leading techniques for leveraging a "frozen" LM -- i.e., leaving its weights untouched -- still often underperform fine-tuning approaches which modify these weights in a task-dependent way. Those, in turn, suffer forgetfulness and compromise versatility, suggesting a tradeoff between performance and versatility. The main message of this paper is that current frozen-model techniques such as prompt tuning are only the tip of the iceberg, and more powerful methods for leveraging frozen LMs can do just as well as fine tuning in challenging domains without sacrificing the underlying model's versatility. To demonstrate this, we introduce three novel methods for leveraging frozen models: input-dependent prompt tuning, frozen readers, and recursive LMs, each of which vastly improves on current frozen-model approaches. Indeed, some of our methods even outperform fine-tuning approaches in domains currently dominated by the latter. The computational cost of each method is higher than that of existing frozen model methods, but still negligible relative to a single pass through a huge frozen LM. Each of these methods constitutes a meaningful contribution in its own right, but by presenting these contributions together we aim to convince the reader of a broader message that goes beyond the details of any given method: that frozen models have untapped potential and that fine-tuning is often unnecessary.

Motivation & Objective

  • Demonstrate that frozen LMs can achieve competitive performance on multi-task and open-domain QA benchmarks without fine-tuning the backbone model.
  • Propose and validate methods that expand the capabilities of frozen LMs beyond traditional prompt tuning.
  • Show that frozen LM-based components can match or surpass fine-tuned approaches in challenging domains while preserving model versatility.
  • Highlight practical considerations like cost and scalability of using frozen LMs in real-world deployments.

Proposed method

  • Introduce input-dependent prompt tuning (ID-PT) to generate input-specific prompts using a small prompt-generator network.
  • Demonstrate retrieval-augmented generation by using huge frozen LMs as readers with a re-ranking step over retrieved documents.
  • Develop LM recursion approaches (textual and neural) to apply multiple passes through a frozen LM to extract more information from inputs.
  • Compare frozen LM approaches to strong fine-tuned baselines on multi-task and open-domain QA benchmarks.
  • Provide architectural and training details for the ID-PT prompt generator and the cross-attention-based prompt synthesis mechanism.

Experimental results

Research questions

  • RQ1Can a frozen language model match or exceed fine-tuned models in massively multi-task settings?
  • RQ2Do frozen LMs augmented with external components (prompt generators, re-rankers, recursive passes) close the gap with fine-tuning in open-domain QA?
  • RQ3How far can retrieval-augmented generation with frozen readers push performance on benchmarks like Natural Questions?
  • RQ4What is the practical cost and scalability impact of deploying frozen-LM-based systems compared to fine-tuned models?

Key findings

  • ID-PT on a frozen 7B J1-Large model nearly matches the fine-tuned 11B T0++ model on the P3 multi-task suite, with performance comparable across task clusters.
  • ID-PT+J1-Large achieved higher average scores on sentiment and paraphrase tasks, while T0++ performed better on structure-to-text and summarization tasks.
  • A frozen J1-Large-7B reader with re-ranked passages can surpass some fine-tuned readers on Natural Questions when using the same retriever (DPR) and improves with Spider+BM25 retrieval.
  • A 17B frozen J1-Grande reader with re-ranking can match or exceed FiD-Distill and EMDR2 baselines under certain retriever setups on Natural Questions.
  • LM recursion with multiple passes through a frozen LM yields significant gains in the closed-book ODQA setting compared to a single pass.
  • Across results, frozen-LM approaches can rival or outperform several strong fine-tuned baselines while preserving the model’s versatility.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.