QUICK REVIEW

[论文解读] Survival is the Only Reward: Sustainable Self-Training Through Environment-Mediated Selection

Jennifer Dodgson, Alfath Daryl Alhajir|arXiv (Cornell University)|Jan 18, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

论文证明了一个概念验证的自我训练架构，在学习由环境介导的生存信号（资源约束）驱动，而非外部奖励，从而实现可持续的开放式自我改进和负空间学习。

ABSTRACT

Self-training systems often degenerate due to the lack of an external criterion for judging data quality, leading to reward hacking and semantic drift. This paper provides a proof-of-concept system architecture for stable self-training under sparse external feedback and bounded memory, and empirically characterises its learning dynamics and failure modes. We introduce a self-training architecture in which learning is mediated exclusively by environmental viability, rather than by reward, objective functions, or externally defined fitness criteria. Candidate behaviours are executed under real resource constraints, and only those whose environmental effects both persist and preserve the possibility of future interaction are propagated. The environment does not provide semantic feedback, dense rewards, or task-specific supervision; selection operates solely through differential survival of behaviours as world-altering events, making proxy optimisation impossible and rendering reward-hacking evolutionarily unstable. Analysis of semantic dynamics shows that improvement arises primarily through the persistence of effective and repeatable strategies under a regime of consolidation and pruning, a paradigm we refer to as negative-space learning (NSL), and that models develop meta-learning strategies (such as deliberate experimental failure in order to elicit informative error messages) without explicit instruction. This work establishes that environment-grounded selection enables sustainable open-ended self-improvement, offering a viable path toward more robust and generalisable autonomous systems without reliance on human-curated data or complex reward shaping.

研究动机与目标

推动并形式化自我训练中的内源性选择问题，以避免奖励劫持和语义漂移。
提出一个外部基准、基于后果的选择机制，将持续性绑定到现实世界的资源约束上。
Demonstrate a sandboxed environment where candidate behaviors are evaluated by their impact on conserved resources, enabling sustainable self-improvement.
在没有显式指令的情况下，表征学习动态和失败模式，包括负空间学习和元学习策略的出现。

提出的方法

引入一个资源受限的执行环境，其中生存由非易失性内存占用决定。
定义一个简单的代理-环境循环：生成可执行代码、执行、观察环境影响、仅保留总正向轨迹用于训练。
使用递增递归的微调管线，基于 LoRA 的适配器在各迭代之间连接学习，同时避免灾难性遗忘。
采用模块化提示结构，将探索、策略形成与执行分离以提高可解释性和可重复性。
通过跟踪策略多样性并在世代间聚类策略，分析负空间学习的减法式改进。
在内存约束下比较多种训练方案（Miri、Terese、Katalin），以研究学习行为的长期稳定性与鲁棒性。

实验结果

研究问题

RQ1环境介导的选择是否能够防止奖励劫持并支持可持续的、开放式的自我改进，而无需外部监督？
RQ2内存受限的生存（存储空间）作为选择信号如何影响学习动态和长期策略稳定性？
RQ3当数据集由存活轨迹而非显式任务奖励形成时，会出现哪些学习动态（如负空间学习）？
RQ4不同的数据选择机制（时间局部性 vs 基于性能的前-k）如何影响收敛、稳定性和泛化？
RQ5是否可能在不增加数据或精心整理数据集的情况下实现持续改进？

主要发现

当选择基于环境介导的生存而非外部奖励时，可实现可持续的自我改进。
Miri 方案（最近成功轨迹）在严格的内存条件下呈现单调提升，显示在数据不断增长的情况下仍有持续性能提升。
负空间学习作为一种减法机制出现，策略被修剪和巩固，导致高效、可重复的行为。
Katalin 方案（按环境影响的前-k）由于混合了不兼容的历史策略，可能破坏学习稳定性，强调需要与时间局部化数据的一致性以维持稳定。
三条血统在代理指标上均实现改进（如释放的空间、综合改进分数），并在数据效率、稳定性和发散风险方面呈现不同权衡。
人工评估的编码性能保持具有竞争力，表明效率提升并未以牺牲通用编码能力为代价。

Figure 2: Chaining LoRAs to achieve incremental fine tuning without catastrophic forgetting

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。