[論文レビュー] [Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games
This paper reproduces and extends Abdelnabi et al.'s Scoreable Games negotiation benchmark, analyzes generalizability, fairness, and potential biases, and adds broader model coverage and new evaluation metrics.
Large Language Models (LLMs) demonstrate significant potential in multi-agent negotiation tasks, yet evaluation in this domain remains challenging due to a lack of robust and generalizable benchmarks. Abdelnabi et al. (2024) introduce a negotiation benchmark based on Scoreable Games, with the aim of developing a highly complex and realistic evaluation framework for LLMs. Our work investigates the reproducibility of claims in their benchmark, and provides a deeper understanding of its usability and generalizability. We replicate the original experiments on additional models, and introduce additional metrics to verify negotiation quality and evenness of evaluation. Our findings reveal that while the benchmark is indeed complex, model comparison is ambiguous, raising questions about its objectivity. Furthermore, we identify limitations in the experimental setup, particularly in information leakage detection and thoroughness of the ablation study. By examining and analyzing the behavior of a wider range of models on an extended version of the benchmark, we reveal insights that provide additional context to potential users. Our results highlight the importance of context in model-comparative evaluations.
研究の動機と目的
- Assess the reproducibility and generalizability of the Scoreable Games negotiation benchmark across more models and settings.
- Identify limitations in the original experimental setup, including leakage detection and ablation studies.
- Improve transparency with supplementary scoring metrics and code fixes to enhance evaluation fairness and reliability.
- Evaluate diversity and adjustability of games, and examine behavioral prompts (cooperative, greedy, adversarial) on negotiation outcomes.
提案手法
- Formalize the Abdulnabi et al. negotiation games (N=6 players, 5 issues, 24 rounds by default).
- Replicate original experiments on additional models using quantized open-source models and GPT-4o mini/GPToo variants.
- Identify and fix leakage handling bugs and several code issues to ensure fair evaluation across models.
- Extend evaluation with new metrics: Utilitarian Social Welfare (USW), Egalitarian Social Welfare (ESW), and Nash Social Welfare (NSW).
- Conduct ablations, cross-game model comparisons, and extended game adjustability tests to assess robustness and generalizability.
- Provide new baseline and extra experiments (Experiment 5–8) to probe benchmark claims beyond the original setup.
実験結果
リサーチクエスチョン
- RQ1Does the Scoreable Games benchmark generalize across a wider range of models beyond those tested in the original work?
- RQ2Are comparisons across models fair and objective given inconsistencies across games and ablation configurations?
- RQ3How do additional evaluation metrics (USW, ESW, NSW) affect interpretation of negotiation quality and fairness?
- RQ4To what extent are the provided games diverse and adjustable, and does prompt-induced diversity actually materialize?
- RQ5How do behavioral prompts (greedy, adversarial, cooperative) impact negotiation outcomes across models?
主な発見
- The benchmark is complex and model comparison remains ambiguous due to inconsistencies across games and sensitivity to ablation settings.
- The authors identify and fix leakage-related code issues, revealing higher variance in leakage measurements for smaller models.
- Evaluation across a broader set of models shows that game difficulty and model performance vary by model and game, challenging the idea of a single universally fair benchmark.
- New metrics (USW, ESW, NSW) corroborate patterns from USW and reveal fairness/efficiency dynamics in negotiations.
- The original construction-prompt bias yields limited true diversity; revised prompts generate broader negotiation contexts and reveal persistent limited diversity in score functions across games.
- A robust reproducibility baseline is proposed as a more interpretable alternative to the original baseline.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。