QUICK REVIEW

[论文解读] EVINCE: Optimizing Multi-LLM Dialogues Using Conditional Statistics and Information Theory

Edward Yi Chang|arXiv (Cornell University)|Aug 26, 2024

Adversarial Robustness in Machine Learning被引用 6

一句话总结

EVINCE 是一个多-LLM对话框架，使用条件统计和信息理论度量来平衡探索与利用，通过LLM之间的对抗辩论，提升诊断准确性和鲁棒性。

ABSTRACT

EVINCE (Entropy and Variation IN Conditional Exchanges) is a novel framework for optimizing multi-LLM dialogues using conditional statistics and information theory. It addresses limitations in multi-agent debate (MAS) frameworks, where multiple LLMs ``chat'' without behavior modulation or mutual information quality assessment. Using dual entropy optimization to balance perspective diversity and prior knowledge, $\EVINCE$ provides quantitative tools to dynamically regulate LLM linguistic behaviors. When mutual information is low and both cross-entropy and Wasserstein distance are high, EVINCE promotes contentious dialogues to expose diverse perspectives and uncover inconsistencies. Conversely, as cross-entropy decreases and mutual information stabilizes, it transitions discussions into a conciliatory phase, encouraging compromise and acknowledgment of valid points. Using information-theoretic metrics and optimizing mutual information, $\EVINCE$ emerges as a structured and highly effective framework for multi-LLM collaboration.

研究动机与目标

推进通用人工智能特征：LLMs的多功能性、适应性和推理能力。
通过培育多样化、结构化的多代理辩论来减轻幻觉和偏见。
提供将条件统计与信息理论联系起来的合作性LLM互动的理论与实证基础。
展示在医疗保健诊断中的实证增益，并讨论更广泛的决策制定含义。

提出的方法

引入 EVINCE 支柱：包容性探索、信息流动力学以及推理质量与连贯性。
定义并利用信息理论度量（熵、互信息、 Jensen-Shannon散度、交叉熵、KL散度、Wasserstein距离）来支配辩论。
提出运行两LLM结构化辩论的 EVINCE 算法，初始争议性较高，迭代直到满足基于 WD、MI 和 CRIT 的收敛标准。
将 CRIT 纳入以评估论证质量，并将其与先前的 SocraSynth 推理（CRIT 算法）整合。
使用双重熵框架来平衡探索（高熵）与利用（低熵），以获得稳健预测。
给出在熵条件下的最优LLM配对的理论熵对偶定理（EDT）。
使用基于论证质量和信息度量的加权方案汇总最终预测。

实验结果

研究问题

RQ1结构化的对抗性LLM对话是否在诊断任务中相对于单模型基线提高预测准确性？
RQ2EVINCE 的双熵方法能否平衡探索与利用，从而在多LLM辩论中降低偏差和幻觉？
RQ3信息理论度量（WD、MI、熵、JS散度）如何跟踪对话进展与收敛？
RQ4将高熵与低熵LLM配对是否会产生互补错误并提高诊断准确性？
RQ5EVINCE 在医疗保健诊断和偏差检测情景中取得了哪些实证增益？

主要发现

EVINCE-enabled pairing of GPT-4 with Claude-3 or Gemini-3 yields a 4-5 percentage point increase in diagnostic accuracy over pre-debate baselines.
In unconstrained predictions on 304 patient cases, GPT-4 led initial accuracy (82.8%), with EVINCE achieving 87.5% in the GPT-4/Claude-3 pairing.
Entropy stabilization, increasing mutual information, and decreasing Wasserstein distance observed over debate rounds, indicating convergence and information exchange.
Confusion-matrix analyses show complementary error patterns between LLMs, supporting the EDT idea of high-entropy vs low-entropy pairing improving robustness.
The study uses a 304-instance subset drawn from a Kaggle dataset (4,921 records total before deduplication) spanning 40 diseases, with top-5 predictions (k=5) used in evaluations.
EVINCE demonstrates potential for identifying possible misdiagnoses and guiding information-remediation through structured dialogue.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。