Skip to main content
QUICK REVIEW

[论文解读] EVINCE: Optimizing Multi-LLM Dialogues Using Conditional Statistics and Information Theory

Edward Yi Chang|arXiv (Cornell University)|Aug 26, 2024
Adversarial Robustness in Machine Learning被引用 6
一句话总结

EVINCE 是一个多-LLM对话框架,使用条件统计和信息理论度量来平衡探索与利用,通过LLM之间的对抗辩论,提升诊断准确性和鲁棒性。

ABSTRACT

EVINCE (Entropy and Variation IN Conditional Exchanges) is a novel framework for optimizing multi-LLM dialogues using conditional statistics and information theory. It addresses limitations in multi-agent debate (MAS) frameworks, where multiple LLMs ``chat'' without behavior modulation or mutual information quality assessment. Using dual entropy optimization to balance perspective diversity and prior knowledge, $\EVINCE$ provides quantitative tools to dynamically regulate LLM linguistic behaviors. When mutual information is low and both cross-entropy and Wasserstein distance are high, EVINCE promotes contentious dialogues to expose diverse perspectives and uncover inconsistencies. Conversely, as cross-entropy decreases and mutual information stabilizes, it transitions discussions into a conciliatory phase, encouraging compromise and acknowledgment of valid points. Using information-theoretic metrics and optimizing mutual information, $\EVINCE$ emerges as a structured and highly effective framework for multi-LLM collaboration.

研究动机与目标

  • 推进通用人工智能特征:LLMs的多功能性、适应性和推理能力。
  • 通过培育多样化、结构化的多代理辩论来减轻幻觉和偏见。
  • 提供将条件统计与信息理论联系起来的合作性LLM互动的理论与实证基础。
  • 展示在医疗保健诊断中的实证增益,并讨论更广泛的决策制定含义。

提出的方法

  • 引入 EVINCE 支柱:包容性探索、信息流动力学以及推理质量与连贯性。
  • 定义并利用信息理论度量(熵、互信息、 Jensen-Shannon散度、交叉熵、KL散度、Wasserstein距离)来支配辩论。
  • 提出运行两LLM结构化辩论的 EVINCE 算法,初始争议性较高,迭代直到满足基于 WD、MI 和 CRIT 的收敛标准。
  • 将 CRIT 纳入以评估论证质量,并将其与先前的 SocraSynth 推理(CRIT 算法)整合。
  • 使用双重熵框架来平衡探索(高熵)与利用(低熵),以获得稳健预测。
  • 给出在熵条件下的最优LLM配对的理论熵对偶定理(EDT)。
  • 使用基于论证质量和信息度量的加权方案汇总最终预测。

实验结果

研究问题

  • RQ1结构化的对抗性LLM对话是否在诊断任务中相对于单模型基线提高预测准确性?
  • RQ2EVINCE 的双熵方法能否平衡探索与利用,从而在多LLM辩论中降低偏差和幻觉?
  • RQ3信息理论度量(WD、MI、熵、JS散度)如何跟踪对话进展与收敛?
  • RQ4将高熵与低熵LLM配对是否会产生互补错误并提高诊断准确性?
  • RQ5EVINCE 在医疗保健诊断和偏差检测情景中取得了哪些实证增益?

主要发现

  • EVINCE-enabled pairing of GPT-4 with Claude-3 or Gemini-3 yields a 4-5 percentage point increase in diagnostic accuracy over pre-debate baselines.
  • In unconstrained predictions on 304 patient cases, GPT-4 led initial accuracy (82.8%), with EVINCE achieving 87.5% in the GPT-4/Claude-3 pairing.
  • Entropy stabilization, increasing mutual information, and decreasing Wasserstein distance observed over debate rounds, indicating convergence and information exchange.
  • Confusion-matrix analyses show complementary error patterns between LLMs, supporting the EDT idea of high-entropy vs low-entropy pairing improving robustness.
  • The study uses a 304-instance subset drawn from a Kaggle dataset (4,921 records total before deduplication) spanning 40 diseases, with top-5 predictions (k=5) used in evaluations.
  • EVINCE demonstrates potential for identifying possible misdiagnoses and guiding information-remediation through structured dialogue.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。