QUICK REVIEW

[论文解读] [Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games

Jorge Carrasco Pollo, Ioannis Kapetangeorgis|arXiv (Cornell University)|Feb 20, 2026

Topic Modeling被引用 0

一句话总结

本论文复现并扩展 Abdelnabi 等人提出的 Scoreable Games 协商基准，分析泛化性、公平性和潜在偏差，并增加更广的模型覆盖与新的评估指标。

ABSTRACT

Large Language Models (LLMs) demonstrate significant potential in multi-agent negotiation tasks, yet evaluation in this domain remains challenging due to a lack of robust and generalizable benchmarks. Abdelnabi et al. (2024) introduce a negotiation benchmark based on Scoreable Games, with the aim of developing a highly complex and realistic evaluation framework for LLMs. Our work investigates the reproducibility of claims in their benchmark, and provides a deeper understanding of its usability and generalizability. We replicate the original experiments on additional models, and introduce additional metrics to verify negotiation quality and evenness of evaluation. Our findings reveal that while the benchmark is indeed complex, model comparison is ambiguous, raising questions about its objectivity. Furthermore, we identify limitations in the experimental setup, particularly in information leakage detection and thoroughness of the ablation study. By examining and analyzing the behavior of a wider range of models on an extended version of the benchmark, we reveal insights that provide additional context to potential users. Our results highlight the importance of context in model-comparative evaluations.

研究动机与目标

在更多模型与设定下评估 Scoreable Games 协商基准的可重复性和泛化性。
识别原始实验设计的局限性，包括泄漏检测与消融研究。
通过补充评分指标和代码修复提升评估公平性与可靠性，增强透明度。
评估游戏的多样性与可调性，并考察行为性提示（合作、贪婪、人刚对抗）对协商结果的影响。

提出的方法

将 Abdulnabi 等人的协商游戏形式化（N=6 名玩家，5 个议题，默认 24 轮）。
在额外模型上复现原始实验，使用量化的开源模型和 GPT-4o mini/GPToo 变体。
识别并修复泄漏处理的错误及若干代码问题，确保对各模型的公平评估。
扩展评估的新指标：Utilitarian Social Welfare (USW), Egalitarian Social Welfare (ESW), 和 Nash Social Welfare (NSW)。
进行消融、跨游戏模型比较，以及扩展的游戏可调性测试以评估鲁棒性与泛化性。
提供新的基线与额外实验（Experiment 5–8），以在原始设定之外检验基准断言。

实验结果

研究问题

RQ1Scoreable Games 基准是否能在比原始工作测试范围更广的模型上泛化？
RQ2在各游戏及消融配置不一致的情况下，跨模型的比较是否公平、客观？
RQ3新增评估指标（USW、ESW、NSW）如何影响对协商质量与公平性的解读？
RQ4提供的游戏是否多样且可调，提示引发的多样性是否真的实现？
RQ5行为性提示（贪婪、对抗、合作）对不同模型的协商结果有何影响？

主要发现

该基准较为复杂，且由于各游戏不一致以及对消融设置的敏感性，模型比较仍存在歧义。
作者识别并修复了泄漏相关的代码问题，揭示较小模型的泄漏测量方差更高。
对更广泛模型集合的评估显示，游戏难度与模型性能因模型与游戏而异，挑战“单一普遍公平基准”的观点。
新指标（USW、ESW、NSW）与 USW 的模式相符，并揭示了协商中的公平性/效率动态。
原始的构造提示偏差导致的真正在 diverse 度有限；修订后的提示产生更广的协商情境，但不同游戏中的计分函数仍展现出持续的多样性不足。
提出了一个鲁棒的可重复性基线，作为对原始基线更易解释的替代。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。