QUICK REVIEW

[论文解读] Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Xin Cheng, Shengding Hu|arXiv (Cornell University)|Jan 12, 2026

Topic Modeling被引用 0

一句话总结

本文提出 Engram，一個條件記憶模組，提供可擴展的 N-gram 記憶查詢以補充 MoE，發現記憶與計算之間的 U 形稀疏分配，並在推理、程式/數學與長上下文任務上取得顯著提升。

ABSTRACT

While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic $N$-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone's early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.

研究动机与目标

為 LLMs 提出超越條件計算（MoE）的一致稀疏性軸，以利用靜態知識查詢。
重新審視 N-gram 嵌入作為可擴展、可微分的記憶機制，整合進 Transformer。
提出稀疏分配問題，於固定預算下平衡記憶與計算。
證明 Engram 可擴展至數十億參數，並在推理、知識與程式/數學任務上帶來收益。

提出的方法

將 Engram 呈現為一個條件記憶模組，透過多頭雜湊與確定性寻址檢索靜態 N-gram 嵌入。
實作標記詞彙壓縮以降低詞彙表大小，並支援穩健的 N-gram 後綴。
使用情境感知閘控機制，依據當前隱藏狀態調節檢索到的記憶。
以多分支感知整合與殘差連接的方式，將檢索到的記憶與動態骨幹結合。
解耦記憶與計算，以實現主機記憶體分派與確定性預取以提升效率。
透過稀疏分配框架分析分配，尋找在固定計算資源下 MoE 專家與 Engram 記憶之最佳分割，揭示 U 形擴展法則。

Figure 1 : The Engram Architecture. The module augments the backbone by retrieving static $N$ -gram memory and fusing it with dynamic hidden states via context-aware gating. This module is applied only to specific layers to decouple memory from compute, leaving the standard input embedding and un-em

实验结果

研究问题

RQ1在固定參數與計算預算下，模型容量應如何在條件計算（MoE）與條件記憶（Engram）之間分配？
RQ2引入可擴展的記憶原語是否能補充 MoE，提升知識、推理與長上下文任務之性能？
RQ3在無限記憶或大記憶預算下，Engram 的擴展行為為何？
RQ4透過解耦儲存與計算並實現主機記憶體分派，Engram 是否能維持或改善效率？
RQ5Engram 如何影響內部表徵與實際長上下文檢索？

主要发现

擁有 270 億總參數的 Engram 在多樣任務上超越同參數、同 FLOPs 的 MoE 基線。
在固定預算下，最佳分配呈 U 形曲線，當稀疏容量的一部分分配給 Engram（約佔稀疏預算的 20–25%）時性能最佳。
Engram 在無限記憶制下遵循次方增長的提升，無需額外計算即可帶來顯著收益。
Engram 在長上下文基準（LongPPL 與 RULER）取得顯著增益，並提升檢索主導與程式/數學任務績效。
機制性分析顯示 Engram 減少初期層的靜態重建，實際上加深了推理網路，並釋放注意力以取得全球上下文。

Figure 2 : System implementation of Engram. (a) Training Phase: The massive embedding tables are sharded across available GPUs. An All-to-All communication primitive is employed to retrieve active embedding rows across devices. (b) Inference Phase: Engram tables are offloaded to host memory. By expl

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。