[Paper Review] The Termination Critic
This paper proposes a novel information-theoretic objective for learning termination conditions in options, framing them as compressibility of state encodings rather than reward-based value optimization. By using a learned option transition model as a 'critic' to compute gradients, the method avoids option collapse and produces non-trivial, planning-efficient options that outperform primitive actions and prior methods like A2OC with deliberation cost.
In this work, we consider the problem of autonomously discovering behavioral abstractions, or options, for reinforcement learning agents. We propose an algorithm that focuses on the termination condition, as opposed to -- as is common -- the policy. The termination condition is usually trained to optimize a control objective: an option ought to terminate if another has better value. We offer a different, information-theoretic perspective, and propose that terminations should focus instead on the compressibility of the option's encoding -- arguably a key reason for using abstractions. To achieve this algorithmically, we leverage the classical options framework, and learn the option transition model as a "critic" for the termination condition. Using this model, we derive gradients that optimize the desired criteria. We show that the resulting options are non-trivial, intuitively meaningful, and useful for learning and planning.
Motivation & Objective
- To address the challenge of autonomously discovering useful behavioral abstractions (options) in reinforcement learning.
- To overcome option collapse in existing methods like Option-Critic, where options degenerate into single-action primitives.
- To shift the focus from reward-based termination objectives to information-theoretic compressibility of option encodings.
- To develop a training objective that encourages terminations to concentrate on a small, meaningful set of states for better planning efficiency.
- To decouple termination learning from reward optimization, enabling isolated study of termination quality.
Proposed method
- Proposes a new termination objective based on the predictability (compressibility) of the option's state trajectory, inspired by minimum description length principles.
- Leverages the classical options framework with a learned option transition model as a 'critic' to estimate the quality of termination conditions.
- Derives a termination gradient theorem that relates changes in the option model to changes in the termination condition, enabling end-to-end gradient-based optimization.
- Uses the derived gradient to train terminations via policy gradient methods, while policies are trained on the standard reward objective.
- Employs an online actor-critic termination-critic (ACTC) algorithm that jointly optimizes terminations and policies using the model-based critic.
- Introduces a loss function based on the entropy of the option model's transition dynamics, minimizing it to encourage predictable, compressible option behavior.
Experimental results
Research questions
- RQ1Can a termination objective based on compressibility outperform reward-based objectives in preventing option collapse?
- RQ2Does learning terminations via predictability lead to options that are more useful for planning and learning?
- RQ3Can a model-based critic effectively guide termination learning without relying on reward shaping or hyperparameter-sensitive trade-offs?
- RQ4How does the predictability of option trajectories correlate with downstream planning performance?
- RQ5Can non-trivial, semantically meaningful options be learned without explicit supervision or reward-based termination signals?
Key findings
- The proposed ACTC algorithm successfully prevents option collapse, producing non-trivial options even when policies are trained on the same reward objective.
- Options learned with the compressibility objective achieve faster convergence in value iteration, with average policy value increasing as the predictability objective decreases.
- ACTC outperforms A2OC with deliberation cost in planning performance, matching or exceeding the performance of more deterministic random goal options.
- The information-theoretic termination objective correlates strongly with planning efficiency, suggesting that compressibility is a valid proxy for option quality.
- The use of a learned model as a critic enables effective gradient computation for termination, avoiding sensitivity to hyperparameters common in prior methods.
- Qualitative analysis confirms that learned options exhibit intuitive, goal-directed behavior, focusing on a small set of states for termination.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.