[论文解读] Program-Based Strategy Induction for Reinforcement Learning
本论文使用贝叶斯程序归纳来发现可解释的、程序结构化的强化学习策略在多臂赌博机任务中,揭示离散启发式和视界感知的探索。
Typical models of learning assume incremental estimation of continuously-varying decision variables like expected rewards. However, this class of models fails to capture more idiosyncratic, discrete heuristics and strategies that people and animals appear to exhibit. Despite recent advances in strategy discovery using tools like recurrent networks that generalize the classic models, the resulting strategies are often onerous to interpret, making connections to cognition difficult to establish. We use Bayesian program induction to discover strategies implemented by programs, letting the simplicity of strategies trade off against their effectiveness. Focusing on bandit tasks, we find strategies that are difficult or unexpected with classical incremental learning, like asymmetric learning from rewarded and unrewarded trials, adaptive horizon-dependent random exploration, and discrete state switching.
研究动机与目标
- Motivate the need to identify discrete, interpretable strategies humans/animals use in RL beyond continuous incremental learning.
- Propose a Bayesian program induction framework to discover and compare simple, executable strategies.
- Show that the framework yields strategies matching known heuristics (e.g., WSLS, accumulators) and horizon-adaptive exploration.
- Demonstrate that strategy simplicity and effectiveness can explain behavior under resource-rational trade-offs.
提出的方法
- Formalize strategies as programs built from a primitive operation set (arithmetic, logic, vectors, and task-specific signals).
- Define a memory update function f and a policy function g that together produce actions from memory and history.
- Specify a prior over programs via a grammar and a likelihood based on task value V(π); infer posteriors with MCMC (Metropolis-Hastings).
- Use a two-part evaluation: a generated task model p(h_{t+1}|h_t) and a Bernoulli optimality indicator Ω with log p(Ω=1|π) ∝ β V(π).
- Explore a wide strategy space and identify Pareto-optimal strategies in terms of simplicity (prior) and performance (likelihood).
- Implement sampling moves (subtree regeneration, primitive resampling) and run multiple chains across β values to map the trade-off frontier.

实验结果
研究问题
- RQ1What discrete, executable strategies can explain reinforcement-learning behavior in bandit tasks that differ from classical incremental models?
- RQ2How do simple program-structured strategies perform relative to each other when balancing simplicity and effectiveness?
- RQ3Can resource-rational trade-offs account for observed phenomena like asymmetric learning, horizon-dependent exploration, and discrete state switching?
- RQ4What interpretable strategies emerge (e.g., WSLS, accumulators, horizon-adaptive exploration) and under what task conditions?
- RQ5How does the framework handle non-stationarity and non-Markovian patterns via discrete decision states?
主要发现
| 原语 | 描述 |
|---|---|
| Arithmetic, Logic | |
| 0, …, 49 | Integers from 0 to 49 (inclusive) |
| + , * | Addition, multiplication |
| - , 1/(x) | Negation, multiplicative inverse |
| < , == | Less than, equals |
| && , || , ! | And, or, negation |
| if(c,x,y) | Returns x if condition c is true, y otherwise |
| Vectors | |
| vec_full(x) | A vector filled with the value x |
| vec_n(x1, …, xn) | A vector where the first n entries are supplied and others are 0, e.g., vec_2(x,y)=[x,y,0,0] |
| v[i] | Returns ith entry of v |
| assign(v,i,x) | Updated copy of v , with v[i]=x |
| add_assign(v,i,x) | Updated copy of v , with v[i]=v[i]+x |
| Inputs | |
| prev_action | Previous action, a_t |
| reward | Previous reward, r_t |
| state | Memory from previous trial m_t for f or current trial m_{t+1} for g |
| Action probabilities | |
| logit(l) | For two-action tasks, l=log p(a=0)/p(a=1) |
| softmax(w,v) | Uses unnormalized log probabilities in v , scaled by w |
| action(a) | Takes action a |
| argmax(v) | Takes action with earliest, maximum value in v |
- Identified simple, interpretable strategies such as win-stay lose-shift (WSLS) implemented mainly through policy g.
- Found accumulator-style strategies that progressively integrate rewards to bias choice, yielding high performance under certain horizons.
- Revealed horizon-adaptive random exploration where the inverse temperature of softmax changes with horizon and memory accumulation.
- Discovered discrete decision-state strategies (state machines) that switch between exploration and exploitation, matching WSLS-like and more complex regimes.
- Demonstrated that a bias toward positive information (reward accumulation) can be optimal in limited strategy spaces, aligning with asymmetric learning observations.
- Showed that the framework yields Pareto-frontier strategies balancing prior simplicity and empirical value, offering an interpretable alternative to neural meta-learning.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。