[论文解读] EVINCE: Optimizing Multi-LLM Dialogues Using Conditional Statistics and Information Theory
EVINCE 是一个多-LLM对话框架,使用条件统计和信息理论度量来平衡探索与利用,通过LLM之间的对抗辩论,提升诊断准确性和鲁棒性。
EVINCE (Entropy and Variation IN Conditional Exchanges) is a novel framework for optimizing multi-LLM dialogues using conditional statistics and information theory. It addresses limitations in multi-agent debate (MAS) frameworks, where multiple LLMs ``chat'' without behavior modulation or mutual information quality assessment. Using dual entropy optimization to balance perspective diversity and prior knowledge, $\EVINCE$ provides quantitative tools to dynamically regulate LLM linguistic behaviors. When mutual information is low and both cross-entropy and Wasserstein distance are high, EVINCE promotes contentious dialogues to expose diverse perspectives and uncover inconsistencies. Conversely, as cross-entropy decreases and mutual information stabilizes, it transitions discussions into a conciliatory phase, encouraging compromise and acknowledgment of valid points. Using information-theoretic metrics and optimizing mutual information, $\EVINCE$ emerges as a structured and highly effective framework for multi-LLM collaboration.
研究动机与目标
- 推进通用人工智能特征:LLMs的多功能性、适应性和推理能力。
- 通过培育多样化、结构化的多代理辩论来减轻幻觉和偏见。
- 提供将条件统计与信息理论联系起来的合作性LLM互动的理论与实证基础。
- 展示在医疗保健诊断中的实证增益,并讨论更广泛的决策制定含义。
提出的方法
- 引入 EVINCE 支柱:包容性探索、信息流动力学以及推理质量与连贯性。
- 定义并利用信息理论度量(熵、互信息、 Jensen-Shannon散度、交叉熵、KL散度、Wasserstein距离)来支配辩论。
- 提出运行两LLM结构化辩论的 EVINCE 算法,初始争议性较高,迭代直到满足基于 WD、MI 和 CRIT 的收敛标准。
- 将 CRIT 纳入以评估论证质量,并将其与先前的 SocraSynth 推理(CRIT 算法)整合。
- 使用双重熵框架来平衡探索(高熵)与利用(低熵),以获得稳健预测。
- 给出在熵条件下的最优LLM配对的理论熵对偶定理(EDT)。
- 使用基于论证质量和信息度量的加权方案汇总最终预测。
实验结果
研究问题
- RQ1结构化的对抗性LLM对话是否在诊断任务中相对于单模型基线提高预测准确性?
- RQ2EVINCE 的双熵方法能否平衡探索与利用,从而在多LLM辩论中降低偏差和幻觉?
- RQ3信息理论度量(WD、MI、熵、JS散度)如何跟踪对话进展与收敛?
- RQ4将高熵与低熵LLM配对是否会产生互补错误并提高诊断准确性?
- RQ5EVINCE 在医疗保健诊断和偏差检测情景中取得了哪些实证增益?
主要发现
- EVINCE-enabled pairing of GPT-4 with Claude-3 or Gemini-3 yields a 4-5 percentage point increase in diagnostic accuracy over pre-debate baselines.
- In unconstrained predictions on 304 patient cases, GPT-4 led initial accuracy (82.8%), with EVINCE achieving 87.5% in the GPT-4/Claude-3 pairing.
- Entropy stabilization, increasing mutual information, and decreasing Wasserstein distance observed over debate rounds, indicating convergence and information exchange.
- Confusion-matrix analyses show complementary error patterns between LLMs, supporting the EDT idea of high-entropy vs low-entropy pairing improving robustness.
- The study uses a 304-instance subset drawn from a Kaggle dataset (4,921 records total before deduplication) spanning 40 diseases, with top-5 predictions (k=5) used in evaluations.
- EVINCE demonstrates potential for identifying possible misdiagnoses and guiding information-remediation through structured dialogue.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。