[论文解读] Learning to Trust the Crowd: A Multi-Model Consensus Reasoning Engine for Large Language Models
本论文将多份大模型输出视为一个监督元学习器的输入,学习聚合答案,相较于最佳单一模型和多数投票,通过基于图的共识模型提高准确性。
Large language models (LLMs) achieve strong average performance yet remain unreliable at the instance level, with frequent hallucinations, brittle failures, and poorly calibrated confidence. We study reliability through the lens of multi-model consensus: given responses from several heterogeneous LLMs, can we learn which answer is most likely correct for a given query? We introduce a Multi-Model Consensus Reasoning Engine that treats the set of LLM outputs as input to a supervised meta-learner. The system maps natural language responses into structured features using semantic embeddings, pairwise similarity and clustering statistics, lexical and structural cues, reasoning-quality scores, confidence estimates, and model-specific priors, and then applies gradient-boosted trees, listwise ranking, and graph neural networks over similarity graphs of answers. Using three open-weight LLMs evaluated on compact, resource-constrained subsets of GSM8K, ARC-Challenge, HellaSwag, and TruthfulQA, our best graph-attention-based consensus model improves macro-average accuracy by 4.6 percentage points over the strongest single LLM and by 8.1 points over majority vote, while also yielding lower Brier scores and fewer TruthfulQA hallucinations. Ablation and feature-importance analyses show that semantic agreement and clustering features are most influential, with reasoning-quality and model-prior features providing complementary gains, suggesting supervised multi-model consensus is a practical route toward more reliable LLM behavior, even in a modest single-machine setup.
研究动机与目标
- 通过利用跨模型的分歧与一致性来提升LLM的可靠性。
- 提出在来自多个LLM输出的结构化特征上运行的监督元学习共识框架。
- 在紧凑多任务基准上实例化并评估若干共识体系结构(独立分类器、列表排序、图网络)。
- 在三个开源权重的LLM上对小型数据集展示准确性、校准性提升以及幻觉现象的减少。
- 提供消融分析以理解特征贡献与共识方法的局限性。
提出的方法
- 将M个基模型的回答集合视为输入,传给元模型f_theta以输出每个模型的正确概率。
- 从嵌入、成对相似性、聚类、词汇/结构线索、推理质量分数、置信度和模型先验等提取丰富的逐答案特征。
- 在回答之间构建相似性图,并对特征使用基于图的元模型(GCN/GAT)或独立/列表学习者。
- 在紧凑的GSM8K、ARC-Challenge、HellaSwag、TruthfulQA小数据集上使用三种开源权重LLM(Llama-3-8B-Instruct, Mistral-7B-Instruct, Qwen2-7B-Instruct)。
- 解析并规范自由形式输出中的最终答案以实现正确性标注,采用简单的最终答案提取协议。
- 使用早停对元模型进行训练,标准化连续特征,并以准确率、MRR和Brier分数进行评估。

实验结果
研究问题
- RQ1监督元学习器是否能解释跨模型输出以预测给定查询的正确答案?
- RQ2哪些特征族(语义一致性、聚类、推理质量、置信先验)最强烈推动共识性能的提升?
- RQ3基于图的共识模型是否在多样任务上优于独立分类器和排序模型?
- RQ4共识对标注校准和LLM集合的幻觉倾向有何影响?
- RQ5在小型、日常硬件上运行多模型共识的实际限制与故障模式是什么?
主要发现
| Method | GSM8K | ARC | HellaSwag | TruthfulQA |
|---|---|---|---|---|
| Random model | 49.0 | 32.5 | 63.4 | 35.2 |
| Majority vote | 57.8 | 38.7 | 70.1 | 42.3 |
| Self-consistency | 61.3 | 40.9 | 72.0 | 44.0 |
| Best single model | 62.5 | 41.8 | 73.2 | 45.1 |
| Consensus (logreg) | 65.8 | 44.0 | 74.4 | 47.6 |
| Consensus (GBDT) | 67.1 | 45.2 | 75.1 | 48.9 |
| Consensus (RankNet) | 67.4 | 45.6 | 75.4 | 49.3 |
| Consensus (GAT) | 68.2 | 46.7 | 76.0 | 50.1 |
- 图注意力共识模型(GAT)在宏平均准确率上较最佳单一LLM提升4.6个百分点,较多数投票提升8.1个百分点。
- GAT在GSM8K、ARC-Challenge、HellaSwag和TruthfulQA小数据集中始终优于所有基线。
- 消融分析显示语义一致性和聚类特征最具影响力,推理质量和模型先验特征提供互补提升。
- 在共识下校准性提升,Brier分数降低,TruthfulQA幻觉现象减少。
- 基于图的方法利用结构化的分歧,在高分歧情形下放大少数但正确的答案。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。