QUICK REVIEW

[논문 리뷰] The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

Xiaoyuan Liu, Tian Liang|arXiv (Cornell University)|2026. 02. 12.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

StateLM은 학습된 자기 맥락 엔지니어링 루프와 메모리 작업 도구 키트를 도입하여 모델이 자체 맥락을 관리하고 장문 문서 QA, 채팅 메모리 및 심층 연구 작업에서 베이스라인을 능가합니다.

ABSTRACT

In the world of Harry Potter, when Dumbledore's mind is overburdened, he extracts memories into a Pensieve to be revisited later. In the world of AI, while we possess the Pensieve-mature databases and retrieval systems, our models inexplicably lack the "wand" to operate it. They remain like a Dumbledore without agency, passively accepting a manually engineered context as their entire memory. This work finally places the wand in the model's hand. We introduce StateLM, a new class of foundation models endowed with an internal reasoning loop to manage their own state. We equip our model with a suite of memory tools, such as context pruning, document indexing, and note-taking, and train it to actively manage these tools. By learning to dynamically engineering its own context, our model breaks free from the architectural prison of a fixed window. Experiments across various model sizes demonstrate StateLM's effectiveness across diverse scenarios. On long-document QA tasks, StateLMs consistently outperform standard LLMs across all model scales; on the chat memory task, they achieve absolute accuracy improvements of 10% to 20% over standard LLMs. On the deep research task BrowseComp-Plus, the performance gap becomes even more pronounced: StateLM achieves up to 52% accuracy, whereas standard LLM counterparts struggle around 5%. Ultimately, our approach shifts LLMs from passive predictors to state-aware agents where reasoning becomes a stateful and manageable process.

연구 동기 및 목표

무상태(stateless) LLM에서 상태 인지 에이전트로의 전환을 동기부여하고, 이들이 스스로 메모리와 맥락을 관리하도록 한다.
자체적으로 설계한 맥락을 가능하게 하는 메모리 및 맥락 관리 도구의 일반적 도구키트를 제안한다.
장문 문서 QA, 다회 대화 메모리, 심층 연구 과제에서 교차 도메인 성과를 시연한다.
학습된 컨텍스트 관리가 모델 크기에 따라 확장되고 외부의 인간 주도 컨텍스트 엔지니어링을 능가함을 보여준다.

제안 방법

내부 추론 루프와 Pensieve 형식의 메모리 도구 키트를 갖춘 기초 모델 계열인 StateLM을 소개한다.
상호작용 이력이 deleteContext와 지속 가능한 외부 노트북을 통해 변경 가능하도록 도구 보강된 에이전틱 추론 프로세스를 형식화한다.
지각, 획득 및 기억 관리 체계를 다루기 위한 여섯 가지 도구의 '주문서(spellbook)'를 정의한다( analyzeText, buildIndex, searchEngine, readChunk, note/updateNote, readNote, deleteContext, finish ).
두 단계로 StateLM을 학습시킨다: 전문가 궤적으로부터의 감독 학습(SFT)을 결과 기반 및 프로세스 기반 필터링과 함께, 이어서 궤적 롤아웃과 작업 인지 보상을 포함한 강화 학습.
4B, 8B, 14B 모델을 사용하여 세 도메인(장문 문서 QA, 채팅 메모리, 심층 연구)에서 장문 맥락 벤치마크를 평가한다.

Figure 1 : StateLM (right) maintains a “sawtooth” context-use profile, rather than monotonic accumulation (left).

실험 결과

연구 질문

RQ1고정 맥락 한계를 극복하기 위해 내장 메모리 도구를 사용해 모델이 자체 맥 context를 자율적으로 엔지니어링할 수 있는가?
RQ2학습된 자체 맥 context 엔지니어링이 장문 문서 QA, 다회 대화, 심층 연구 과제에서 성능에 어떻게 영향을 미치는가?
RQ3Pensieve에서 영감을 받은 메모리를 가진 상태 인지 에이전트가 고정 예산 하에서 외부의 스크립트된 맥락 엔지니어링 베이스라인보다 우수한가?
RQ4실제 장문 맥락 환경에서 모델 크기와 작업 난이도에 따라 StateLM은 어떻게 확장되는가?

주요 결과

StateLM은 활성 컨텍스트의 약 1/4 수준만 사용하면서도 장문 문서 QA에서 지시형 베이스라인을 능가한다.
채팅 메모리 작업에서 StateLM은 표준 LLM 대비 절대 정확도에서 10%–20%의 이득을 달성한다.
BrowseComp-Plus 심층 연구 과제에서 StateLM은 최대 52%의 정확도에 도달하는 반면 일반 LLM은 약 5%로, 평균 약 40% 이상의 이득을 보인다.
벤치마크 전반에서 StateLM은 극단적인 맥락 길이에서도 견고한 성능을 유지한다(예: Needle-in-a-Haystack 설정에서 최대 200만 토큰까지).
잘 훈련된 StateLM 위의 강화 학습은 추가 개선을 낳는다(예: StateLM-8B-RL이 일부 벤치마크에서 +3 포인트).
도구 사용 패턴은 작업이 확장될수록 더 많은 검색과 더 적은 메모리 업데이트를 나타내어, 효율적이고 작업 적응형 맥락 관리임을 시사한다.

Figure 2 : The self-context engineering workflow of StateLM. Given a query over a long context, StateLM engages in a multi-round, stateful reasoning loop that analyzes the input, builds an index, and iteratively searches, reads, takes notes, and prunes its working context. Messages highlighted in re

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.