QUICK REVIEW

[论文解读] Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

Ruicheng Ao, Simchi-Levi, David|arXiv (Cornell University)|Jan 28, 2026

Formal Methods in Verification被引用 0

一句话总结

简介：引入两种在求解器回路中的基准测试（OR-Debug-Bench 和 OR-Bias-Bench），评估大语言模型在运筹学中的迭代自我纠错与行为理性，领域专用训练在前沿API之上表现更优，课程学习降低偏见。

ABSTRACT

Operations Research practitioners routinely debug infeasible models through an iterative process: analyzing Irreducible Infeasible Subsystems (\IIS{}), identifying constraint conflicts, and systematically repairing formulations until feasibility is achieved. Yet existing LLM benchmarks evaluate OR as one-shot translation -- given a problem description, generate solver code -- ignoring this diagnostic loop entirely. We introduce two benchmarks that place the extbf{solver in the evaluation loop}. extbf{\ORDebug{}} evaluates iterative self-correction through 5,000+ problems spanning 9 error types; each repair action triggers solver re-execution and \IIS{} recomputation, providing deterministic, verifiable feedback. extbf{\ORBias{}} evaluates behavioral rationality through 2,000 newsvendor instances (1,000 ID + 1,000 OOD), measuring systematic deviations from closed-form optimal policies. Across 26 models and 12,000+ samples, we find that domain-specific RLVR training enables an 8B model to surpass frontier APIs: 95.3\% vs 86.2\% recovery rate (+9.1\%), 62.4\% vs 47.8\% diagnostic accuracy (+14.6\%), and 2.25 vs 3.78 steps to resolution (1.7$ imes$ faster). On \ORBias{}, curriculum training achieves the only negative ID$ ightarrow$OOD bias drift among models evaluated (-9.6\%), reducing systematic bias by 48\% (from 20.0\% to 10.4\%). These results demonstrate that process-level evaluation with verifiable oracles enables targeted training that outperforms scale.

研究动机与目标

通过迭代求解器反馈而非一次性求解来评估LLMs在运筹学中的表现的动机与 formalize（形式化）需要
定义两种基准测试（OR-Debug-Bench 和 OR-Bias-Bench），使用可验证的求解器反馈（IIS）与闭式策略
展示训练方法（基于 GRPO 的强化学习，结合过程奖励与课程学习）以提升推理、纠错准确性与偏见泛化
在26个模型与1万2千多个样本上提供全面评估，以量化领域专用训练和结构化评估带来的收益

提出的方法

两阶段基准框架：阶段I，OR-Debug-Bench 通过 Gurobi IIS 反馈评估迭代调试；阶段II，OR-Bias-Bench 将库存决策与闭式最优策略进行对比评估。
以破坏者为数据生成，生成具有受控不可行类型和真实修复的可行线性规划；IIS 作为用于验证的 oracle。
两种基准的马尔可夫决策过程（MDP）表述：包含状态、行动空间，以及平衡结果、诊断与效率的联合奖励。
使用 GRPO 和 LoRA 微调的组别相对策略优化，结合包含结果、诊断、效率的复合奖励进行 RLVR 训练；忠实度惩罚项避免掩盖根因。
为 OR-Bias-Bench 引入课程学习以缓解“向中心拉拽”偏差，采用分阶段的 CR 分布以提升对OOD的泛化。
PRM（过程奖励模型）提供逐步监督，以在不牺牲结果的前提下提升诊断质量。

实验结果

研究问题

RQ1LLMs 是否能够通过利用确定性 IIS 反馈在迭代循环中自行纠正不可行的运筹学表述？
RQ2领域专用训练与结构化过程监督是否在运筹调试任务中优于通用前沿API？
RQ3课程学习是否在模型从分布内到分布外的库存问题泛化时降低下游偏差？
RQ4诊断准确性（DA）如何与基于 IIS 的调试中的实际最优修复相关？
RQ5在将求解器置入回路的运算问题中，推理效率与泛化之间存在哪些权衡？

主要发现

Model	RR	RR @5	DA	Steps
Qwen3-8B - GRPO	100%	95.3%	62.4%	2.25
Qwen3-8B - Curriculum	100%	94.0%	61.7%	2.22
Qwen3-8B - DAPO	100%	93.8%	60.4%	2.31
Qwen3-8B - SFT	99.8%	93.1%	60.8%	2.34
o4-mini	97.8%	86.2%	47.8%	3.15
claude-sonnet-4	100%	86.2%	50.1%	3.71

领域专用的8B模型在恢复与诊断性能上超过前沿API：RR @5 = 95.3% vs 86.2%，DA = 62.4% vs 47.8%。
在 GRPO 下达到 2.25 步就能解决，而 API 模型为 3.78，效率提升约1.7倍。
课程学习为唯一产生负向的 ID→OOD 偏差漂移（-9.6%），使偏差从 20.0% 降至 10.4%，并实现更好的 OOD 泛化。
基于PRM的逐步监督在某些成本下将诊断准确性提升了 4.7%（68.0% → 72.7%），但对恢复率有一定影响。
在26个模型、1万2千多个样本中，领域专用训练在更难的错误类型（E–I）上收益更大，对于较易类型（A–D）几乎具有普遍鲁棒性。
推理扩展性显示领域专用模型在较少 token 下实现高恢复，与 API 模型相比具备 1.87x 的 token 效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。