[论文解读] Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale
JiSi 是一个无需训练的开源 LLM 协作框架,结合查询-响应路由、基于支持集的聚合器选择以及自适应路由-聚合切换,通过协同十个开源 LLM 在九个基准上实现比 Gemini-3-Pro 47% 成本更优的表现。
Large Language Models (LLMs) have rapidly advanced, with Gemini-3-Pro setting a new performance milestone. In this work, we explore collective intelligence as an alternative to monolithic scaling, and demonstrate that open-source LLMs' collaboration can surpass Gemini-3-Pro. We first revisit LLM routing and aggregation at scale and identify three key bottlenecks: (1) current train-free routers are limited by a query-based paradigm focusing solely on textual similarity; (2) recent aggregation methods remain largely static, failing to select appropriate aggregators for different tasks;(3) the complementarity of routing and aggregation remains underutilized. To address these problems, we introduce JiSi, a novel framework designed to release the full potential of LLMs' collaboration through three innovations: (1) Query-Response Mixed Routing capturing both semantic information and problem difficulty; (2) Support-Set-based Aggregator Selection jointly evaluating the aggregation and domain capacity of aggregators; (3) Adaptive Routing-Aggregation Switch dynamically leveraging the advantages of routing and aggregation. Comprehensive experiments on nine benchmarks demonstrate that JiSi can surpass Gemini-3-Pro with only 47% costs by orchestrating ten open-source LLMs, while outperforming mainstream baselines. It suggests that collective intelligence represents a novel path towards Artificial General Intelligence (AGI).
研究动机与目标
- 将协同智能作为实现类 AGI 能力的替代路径,挑战单体扩展的局限性。
- 识别在扩展到大量开源 LLM 时,现有路由与聚合方法的瓶颈。
- 提出一个简约的 JiSi 框架,利用深层语义、任务难度与领域知识实现更好的路由与聚合。
- 证明通过 JiSi 协调十个开源 LLM,可以超越闭源模型与基线,同时降低成本。
提出的方法
- 引入三大核心创新:查询-响应混合路由,通过 LLM 生成的响应和代币成本捕捉深层语义与任务难度。
- 提出基于支持集的聚合器选择,利用更大的嵌入支持集在域内特定能力与通用能力之间动态选择聚合器。
- 增加自适应路由-聚合切换,根据 refined prior scores 与响应质量在路由与聚合之间切换,以抑制噪声。
实验结果
研究问题
- RQ1通过 JiSi 协调的开源 LLM 能否在多样化基准上超越领先的闭源 LLM(如 Gemini-3-Pro)?
- RQ2路由、聚合及其组合是否受益于自适应、任务感知机制,而非静态一次性策略?
- RQ3一个无需训练、基于嵌入库并结合查询-响应信号的方案,是否足以在众多开源模型上扩展并降低成本?
- RQ4所提出的组件如何影响在广泛任务上的准确性、效率与可扩展性?
- RQ5与专有 LLM 相比,JiSi 的成本效率意义何在?
主要发现
| 模型 | AIME | Arena-Hard | GPQA | HLE | LiveCodeBench | LiveMathBench | MMLU-Pro | SimpleQA | SWE-bench | 平均值 |
|---|---|---|---|---|---|---|---|---|---|---|
| DeepSeek-R1-0528 | 72.22 | 64.89 | 78.33 | 16.67 | 76.03 | 72.97 | 84.67 | 28.66 | 25.33 | 57.75 |
| DeepSeek-V3-0324 | 38.89 | 59.56 | 68.33 | 3.70 | 61.51 | 59.46 | 78.44 | 26.43 | 24.00 | 46.70 |
| DeepSeek-V3.1-Terminus | 55.56 | 64.67 | 78.33 | 8.64 | 64.67 | 67.57 | 84.56 | 25.12 | 26.00 | 52.79 |
| GLM-4.6 | 88.89 | 69.56 | 80.00 | 14.20 | 58.99 | 64.86 | 80.89 | 25.89 | 22.67 | 56.22 |
| Intern-S1 | 38.89 | 68.00 | 70.00 | 9.72 | 46.69 | 59.46 | 83.00 | 14.33 | 8.00 | 44.23 |
| Kimi-K2-0905 | 72.22 | 72.22 | 71.67 | 5.09 | 62.15 | 75.68 | 80.78 | 30.66 | 24.00 | 54.94 |
| DeepSeek-V3.2-Thinking | 88.89 | 62.44 | 88.33 | 24.69 | 83.91 | 78.38 | 87.33 | 27.81 | 24.67 | 62.94 |
| DeepSeek-V3.2-Speciale | 94.44 | 55.33 | 83.33 | 27.16 | 86.75 | 75.68 | 87.44 | 39.52 | 40.67 | 65.59 |
| Qwen3-235B-A22B-2507 | 77.78 | 75.33 | 55.00 | 9.41 | 58.36 | 72.97 | 83.78 | 54.01 | 16.67 | 55.92 |
| Qwen3-235B-A22B-Thinking-2507 | 72.22 | 77.78 | 80.00 | 7.56 | 75.71 | 48.65 | 80.56 | 49.31 | 20.00 | 56.87 |
| Claude-Sonnet-4 | 41.11 | 55.47 | 71.33 | 4.60 | 56.85 | 62.16 | 83.58 | 15.58 | 35.33 | 47.33 |
| Claude-Sonnet-4.5 | 27.78 | 64.00 | 71.67 | 7.56 | 60.57 | 59.46 | 86.33 | 16.18 | 34.00 | 47.51 |
| Grok-4 | 88.89 | 56.89 | 88.33 | 24.42 | 81.03 | 75.68 | 86.56 | 48.38 | 27.33 | 64.17 |
| GPT-5 | 83.33 | 67.11 | 88.33 | 25.77 | 84.54 | 78.38 | 87.22 | 48.00 | 16.00 | 64.30 |
| GPT-5.2-Thinking | 83.33 | 85.78 | 93.33 | 29.94 | 90.50 | 78.38 | 86.67 | 35.21 | 12.67 | 66.20 |
| Gemini-3-Pro | 94.44 | 74.55 | 91.67 | 33.02 | 89.59 | 78.38 | 89.33 | 70.03 | 18.00 | 71.00 |
| JiSi w/o Adaptive Aggregation | 94.44 | 86.44 | 85.00 | 30.09 | 89.27 | 78.38 | 87.44 | 51.46 | 37.33 | 71.09 |
| JiSi (Our JiSi) | 94.44 | 88.44 | 86.67 | 27.62 | 89.27 | 81.08 | 86.78 | 53.70 | 41.33 | 72.15 |
- JiSi 在九个基准的平均性能上超越 Gemini-3-Pro,同时成本节省 53.23%。
- JiSi 在所报告的结果中超过所有开源 LLM、路由器基线和多智能体基线。
- 仅路由器的变体已超过其他路由器,加入动态聚合器选择后进一步提升(+聚合带来 +1.41%,+自适应聚合带来 +1.06%)。
- JiSi 通过聚合可以超过理论上的“最佳 LLM”界限(+1.6%),展示集体智能的强大潜力。
- 成本表显示 JiSi 在各基准上以显著更低的成本实现竞争性或更优的表现(例如 JiSi 相对 Grok-4、GPT-5、Gemini-3-Pro)。
- 在增加新的开源 LLM 时,JiSi 显示出稳定的性能提升,表明在日益演化的生态系统中具备良好的可扩展性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。