QUICK REVIEW

[논문 리뷰] Thinking into the Future: Latent Lookahead Training for Transformers

Lorenzo Noci, Gregor Bachmann|arXiv (Cornell University)|2026. 03. 03.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

논문은 잠재적 선행(latent lookahead)을 도입한다, 트랜스포머가 다음 토큰을 방출하기 전에 tau 단계만큼 숨겨진 상태를 앞으로 펼친 다음, 다음 tau 실제 토큰에 대해 지도 학습하는 훈련 전략으로, 계획 및 추론 작업을 개선한다.

ABSTRACT

Autoregressive language models trained with next-token prediction generate text by sampling one discrete token at a time. Although very scalable, this objective forces the model to commit at every step, preventing it from exploring or reflecting upon multiple plausible continuations. Furthermore, the compute allocation across tokens is uniform; every token is formed based on a single forward-pass, potentially limiting the model's expressiveness in cases where difficult tokens require inherently more compute. Towards addressing these limitations, we introduce latent lookahead, a training strategy that enables models to "think" before generating: at selected positions in the sequence, before committing to the next token, the model performs a multi-step lookahead in latent space. More precisely, instead of sampling future tokens, we leverage the network's latent space by recursively feeding its hidden states back into the context for $τ$ steps, investing more compute on predicting that token. This produces $τ$ latent predictions that are supervised against the next $τ$ ground-truth tokens, encouraging the model to "lookahead" and refine its prediction. We show that latent lookahead substantially outperforms both autoregressive and non-autoregressive baselines on planning tasks such as maze solving, Sudoku, and ProsQA, where foresight is essential.

연구 동기 및 목표

autoregressive 다음 토큰 예측에서 근시안적 행동을 다단계 잠재적 선행으로 가능하게 한다.
잠재 예측을 미래의 실제 토큰에 대해 감독하는 미분 가능 학습 목표를 도입한다.
잠재적 생각으로 컨텍스트를 확장하여 토큰당 비균일한 계산 자원 할당을 가능하게 한다.
선견지가 필요한 계획 지향 작업(Sudoku, ProsQA, Maze 등)에서 이득을 시연한다.

제안 방법

각 가시 토큰 x_i 이후에 컨텍스트를 확장하는 잠재 토큰 z_{i,j}를 정의한다.
증강된 컨텍스트 e^{aug}에서 트랜스포머를 사용해 tau 잠재 단계를 펼쳐 z_{i,j}를 생성한다.
다음 tau 실제 토큰을 감독 대상으로 하여 x_{i+j}를 예측하도록 z_{i,j}를 학습한다.
일반적인 next-token 예측으로 가시 토큰을 학습하는 동안 증강된 주의(attention)를 통해 잠재 생각을 볼 수 있도록 허용한다.
잠재 생각의 병렬 생성과 생각 내에서 양방향 잠재 주의를 허용하기 위해 비완전한 인과(attention) 마스크를 사용한다.
L_NTP와 L_latent를 하나의 목표 L = L_NTP + L_latent로 결합한다.

Figure 1 : Standard autoregressive inference vs latent lookahead. Left: in standard next token prediction, the model samples from the hidden state of the latest generated token after applying the final unembedding head, and appends the generated token to the context. Right: in our approach, the mode

실험 결과

연구 질문

RQ1잠재적 선행이 자 autoregressive 기반 및 pause 토큰 기반 대비 계획 중심 과제에서 성능을 향상시키는가?
RQ2잠재 수평 tau와 잠재 위치 수 n을 늘리면 성능에 어떤 영향이 있는가?
RQ3잠재적 선행이 다중 토큰 예측이나 루프형 정제(루프드 리파인먼트) 기반 대비 더 효과적인가?
RQ4주 의 마스크와 잠재 해독 전략이 학습 및 추론에 어떤 영향을 미치는가?

주요 결과

모델	Mini 4x4 Sudoku	전체 9x9 Sudoku	ProsQA	Maze
Ours	93.5	35.5	91.8	21.5
Pause	86.0	12.5	82.5	19.5
Standard NTP	78.0	12.5	80.5	18.5

잠재적 선행은 Sudoku, ProsQA, Maze와 같은 계획 작업에서 autoregressive 및 pause-baselines보다 상당히 우수하다.
전부 9x9 Sudoku 정확도는 NTP 베이스라인의 12.5%에서 잠재적 선행으로 35.5%로 향상된다.
Mini-Sudoku 4x4에서 잠재적 선행은 93.5% 정확도를 달성하여 더 깊은 베이스라인을 능가한다.
잠재 수평 tau를 증가시키면 태스크 전반에서 정확도가 단조 증가하고 베이스라인을 능가하며 pause-tokens보다 포화가 덜 된다.
시퀀스의 초기에 잠재 생각을 순차적으로 할당하는 것이 무작위 할당보다 이득이 더 크다.
시각화에서 잠재 토큰이 결정 정점 근처에 집중되어 예측의 반복적 정교화를 시사한다.

Figure 2 : Lookahead behaviour when solving a Sudoku. In the first slot, both $1$ and $3$ are viable options. However, when thinking ahead to the second empty slot, where $3$ is the only plausible entry, it is easy to realize that $1$ is the right choice for the first slot.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.