QUICK REVIEW

[論文レビュー] Thinking into the Future: Latent Lookahead Training for Transformers

Lorenzo Noci, Gregor Bachmann|arXiv (Cornell University)|Mar 3, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

この論文は latent lookahead を導入します。これは、トランスフォーマーが次のトークンを出力する前に tau ステップだけ隠れ状態を前方展開し、次の tau 個のグラウンドトゥルース・トークンに対して監督する訓練戦略で、計画・推論タスクを改善します。

ABSTRACT

Autoregressive language models trained with next-token prediction generate text by sampling one discrete token at a time. Although very scalable, this objective forces the model to commit at every step, preventing it from exploring or reflecting upon multiple plausible continuations. Furthermore, the compute allocation across tokens is uniform; every token is formed based on a single forward-pass, potentially limiting the model's expressiveness in cases where difficult tokens require inherently more compute. Towards addressing these limitations, we introduce latent lookahead, a training strategy that enables models to "think" before generating: at selected positions in the sequence, before committing to the next token, the model performs a multi-step lookahead in latent space. More precisely, instead of sampling future tokens, we leverage the network's latent space by recursively feeding its hidden states back into the context for $τ$ steps, investing more compute on predicting that token. This produces $τ$ latent predictions that are supervised against the next $τ$ ground-truth tokens, encouraging the model to "lookahead" and refine its prediction. We show that latent lookahead substantially outperforms both autoregressive and non-autoregressive baselines on planning tasks such as maze solving, Sudoku, and ProsQA, where foresight is essential.

研究の動機と目的

自己回帰的な次トークン予測での近視的な振る舞いを、マルチステップの latent lookahead を有効化して動機づけ、対処する。
将来のグラウンドトゥルース・トークンに対して latent 予測を監督する differentiable な訓練目的関数を導入する。
潜在的思考を拡張してコンテキストを拡充することで、トークンごとの非一様な計算配分を可能にする。
Sudoku、ProsQA、Maze などの foresight を要する計画志向タスクでの利得を示す。

提案手法

x_i の後の文脈を拡張する潜在トークン z_{i,j} を定義する。
拡張されたコンテキスト e^{aug} 上のトランスフォーマーを tau 回 latent ステップを展開して z_{i,j} を生成する。
z_{i,j} を x_{i+j} を予測するように訓練し、次の tau 個のグラウンドトゥルース・トークンに対して監督する。
可視性を latent thoughts に対して拡張されたアテンションを介して許可しつつ、標準の次トークン予測で可視化されるトークンを訓練する。
潜在思考の並列生成と思考内の双方向 latent アテンションを許す非完全因果アテンションマスクを使用する。
L_NTP と L_latent を統合して L = L_NTP + L_latent の単一目的関数とする。

Figure 1 : Standard autoregressive inference vs latent lookahead. Left: in standard next token prediction, the model samples from the hidden state of the latest generated token after applying the final unembedding head, and appends the generated token to the context. Right: in our approach, the mode

実験結果

リサーチクエスチョン

RQ1 latent lookahead は autoregressive ベースラインや pause-token ベースラインと比較して計画重視タスクの性能を改善するか。
RQ2 latent horizon tau と latent positions の数 n の増加は性能にどう影響するか。
RQ3 latent lookahead は multi-token prediction や looped-refinement ベースラインより効果的か。
RQ4 アテンションマスクと latent decoding 戦略は学習と推論にどのような影響を与えるか。

主な発見

モデル	Mini 4x4 Sudoku	Full 9x9 Sudoku	ProsQA	Maze
Ours	93.5	35.5	91.8	21.5
Pause	86.0	12.5	82.5	19.5
Standard NTP	78.0	12.5	80.5	18.5

latent lookahead は Sudoku、ProsQA、Maze のような計画タスクで autoregressive および pause ベースラインを大幅に上回る。
完全な 9x9 Sudoku の精度は NTP ベースラインの 12.5% から latent lookahead で 35.5% に向上。
Mini-Sudoku 4x4 では latent lookahead が 93.5% の精度を達成し、より深いベースラインを上回る。
latent horizon tau の増加はタスク全体で精度を単調に向上させ、ベースラインを上回り pause-tokens ほど平坦化しない。
連続的に序盤の潜在思考を割り当てる方が、ランダム割り当てより効果的である。
Visualizations は潜在トークンが意思決定の頂点近くに集中することを示し、予測の反復的な洗練を示唆する。

Figure 2 : Lookahead behaviour when solving a Sudoku. In the first slot, both $1$ and $3$ are viable options. However, when thinking ahead to the second empty slot, where $3$ is the only plausible entry, it is easy to realize that $1$ is the right choice for the first slot.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。