Skip to main content
QUICK REVIEW

[논문 리뷰] Transformers Can Do Arithmetic with the Right Embeddings

Sean McLeish, Arpit Bansal|arXiv (Cornell University)|2024. 05. 27.
Computability, Logic, AI Algorithms인용 수 6
한 줄 요약

이 논문은 Abacus Embeddings를 도입하여 숫자 내 자릿수 위치를 인코딩하고, 트랜스포머가 장거리 산술을 수행하고 입력 주입(input injection)과 루프형 transformer 아키텍처의 도움을 받아 훨씬 더 큰 자릿수 길이에 대해 외삽할 수 있도록 합니다.

ABSTRACT

The poor performance of transformers on arithmetic tasks seems to stem in large part from their inability to keep track of the exact position of each digit inside of a large span of digits. We mend this problem by adding an embedding to each digit that encodes its position relative to the start of the number. In addition to the boost these embeddings provide on their own, we show that this fix enables architectural modifications such as input injection and recurrent layers to improve performance even further. With positions resolved, we can study the logical extrapolation ability of transformers. Can they solve arithmetic problems that are larger and more complex than those in their training data? We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance, achieving up to 99% accuracy on 100 digit addition problems. Finally, we show that these gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.

연구 동기 및 목표

  • Identify the architectural and data representation bottlenecks that limit transformer arithmetic abilities.
  • Propose Abacus Embeddings to encode digit significance and improve length generalization.
  • Evaluate how recurrence and input injection interact with Abacus Embeddings to boost performance on addition, multiplication, and sorting.
  • Demonstrate state-of-the-art extrapolation to 100+ digit arithmetic and transferability to other algorithmic tasks.

제안 방법

  • Introduce Abacus Embeddings: a learned positional embedding applied to all digits of the same significance within a number.
  • Train decoder-only causal transformers on 20-million-sample addition data with least-significant-digit-first formatting and no padding.
  • Compare standard transformers, input injection variants, and looped transformers across absolute and relative embedding schemes.
  • Evaluate in-distribution, out-of-distribution, and extreme OOD performance, including 100+ digit addition.
  • Extend experiments to multiplication and sorting to test generalization to other algorithmic tasks.
  • Examine compatibility of Abacus Embeddings with FIRE and RoPE relative positional embeddings.
Figure 1: Zero shot exact match accuracy on addition using depth sixteen transformer (decoder only) models trained on operands of up to 20 digits. Compared to state-of-the-art embeddings (left), our new Abacus Embeddings (right) dramatically improve generalization to unseen digit lengths. The interi
Figure 1: Zero shot exact match accuracy on addition using depth sixteen transformer (decoder only) models trained on operands of up to 20 digits. Compared to state-of-the-art embeddings (left), our new Abacus Embeddings (right) dramatically improve generalization to unseen digit lengths. The interi

실험 결과

연구 질문

  • RQ1Can Abacus Embeddings enable length-generalization and zero-shot extrapolation for multi-digit addition beyond training lengths?
  • RQ2Do recurrence and input injection further reduce generalization error when used with Abacus Embeddings?
  • RQ3How well do these methods transfer to larger arithmetic (multiplication) and to non-arithmetic algorithmic tasks (sorting)?
  • RQ4Are Abacus Embeddings compatible and complementary with existing relative position embeddings like FIRE and RoPE?

주요 결과

  • Abacus Embeddings dramatically improve generalization for addition, enabling up to 99.1% accuracy on 100-digit addition and extrapolation to 120-digit problems.
  • Combined Abacus Embeddings with input injection and looped transformers achieve near-perfect generalization and up to 6× length extension versus training lengths.
  • Looped transformers with recurrence can reduce error rates by up to ~50% compared to non-recurrent baselines on out-of-distribution addition.
  • Abacus Embeddings also improve in-distribution performance for multiplication and enhance sorting accuracy in out-of-distribution scenarios when combined with FIRE.
  • Abacus Embeddings are compatible with FIRE and RoPE, and together with FIRE they unlock generalization beyond what FIRE alone can achieve.
Figure 2: Visualization of data formats and positional embeddings. Abacus Embeddings give the same positional embeddings to all digits of the same significance.
Figure 2: Visualization of data formats and positional embeddings. Abacus Embeddings give the same positional embeddings to all digits of the same significance.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.