QUICK REVIEW

[论文解读] Language Model Memory and Memory Models for Language

Benjamin L. Badger|arXiv (Cornell University)|Feb 13, 2026

Topic Modeling被引用 0

一句话总结

论文表明标准语言模型嵌入存储的输入信息很少，而自编码器存储几乎完美的记忆，并引入结合目标与课程学习的编码器-解码器记忆模型，以形成并解码信息丰富的记忆。

ABSTRACT

The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory', is widely employed but not well characterized. We find that language model embeddings typically contain relatively little input information regardless of data and compute scale during training. In contrast, embeddings from autoencoders trained for input regeneration are capable of nearly perfect memory formation. The substitution of memory embeddings for token sequences leads to substantial computational efficiencies, motivating the introduction of a parallelizable encoder-decoder memory model architecture. Upon causal training these models contain information-poor embeddings incapable of arbitrary information access, but by combining causal and information retention objective functions they learn to form and decode information-rich memories. Training can be further streamlined by freezing a high fidelity encoder followed by a curriculum training approach where decoders first learn to process memories and then learn to additionally predict next tokens. We introduce the perspective that next token prediction training alone is poorly suited for accurate memory formation as the objective itself is non-invertible, motivating the use of combined objective functions for models where the entire input is not exposed.

研究动机与目标

评估在不同训练规程下，语言模型嵌入保留多少输入信息。
比较因果语言模型、检索模型和自编码器在记忆形成与可逆性方面的差异。
提出可并行化的编码器-解码器记忆结构，以实现任意输入信息的检索。
演示训练策略（冻结编码器、课程学习）在不牺牲效率的前提下改善记忆形成。

提出的方法

通过可训练解码器对嵌入进行反转以测量信息保留并重建输入序列。
引入基于信息量的框架，使用熵比与基于哈明距离的令牌准确性度量。
开发可并行化的编码器-解码器记忆模型，并在因果训练与组合目标函数下进行评估。
探索冻结编码器的记忆模型与课程训练，以将信息保留与下一个令牌预测分离。
使用预训练的大型语言模型作为记忆模型解码器进行实验，以评估在不同模型规模上的可扩展性。
应用三种评估方式：编码器-解码器信息保留、拷贝任务与空白拷贝任务，以探测记忆能力。

Figure 1: Information retention experimental approach (left) and example training runs (right).

实验结果

研究问题

RQ1因果语言模型在其记忆嵌入中保留了多少输入信息？
RQ2是否可以训练记忆模型形成准确、信息丰富且可被独立解码器解码的记忆？
RQ3编码器-解码器记忆架构是否提供计算优势及与全上下文模型相当的记忆能力？
RQ4哪些训练策略（如冻结编码器、课程学习、组合目标）可以在不牺牲语言建模性能的前提下优化记忆形成？
RQ5预训练的大型语言模型是否能作为记忆增强编码器的解码器有效运行？

主要发现

因果语言模型的记忆在各类数据和计算规模下包含的输入信息相对较少。
为输入再生而训练的自编码器形成信息丰富的记忆，接近近乎完美的记忆。
可并行化的编码器-解码器记忆架构与组合目标提高记忆形成并实现任意信息访问。
冻结编码器的记忆模型结合课程训练实现高效训练和鲁棒的记忆能力。
以因果与拷贝目标结合训练的记忆模型在既能预测下一令牌又能存储/使用信息丰富记忆方面具有潜力，尽管具体性能取决于结构选择和训练规程。
仅仅扩大模型规模在使用来自大型预训练LLM的解码器时，对记忆模型的信息保留带来的是有限的提升。

Figure 2: Memory Model Architecture and $n_{ctx}=256$ per chunk, $s=4$ chunk causal training characteristics on FineWeb. Mixers are $d_{m}=512$ for encoders, $d_{m}=1024$ for decoders and Transformers $d_{m}=256$ and $d_{m}=512$ for compute equivalence.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。