[論文レビュー] Large Language Model Routing with Benchmark Datasets
要約: The paper learns to route among many open-source LLMs for a new task by converting benchmark data into per-model binary correctness predictors, enabling an efficient model selection strategy that often outperforms single-model baselines and reduces inference costs.
There is a rapidly growing number of open-source Large Language Models (LLMs) and benchmark datasets to compare them. While some models dominate these benchmarks, no single model typically achieves the best accuracy in all tasks and use cases. In this work, we address the challenge of selecting the best LLM out of a collection of models for new tasks. We propose a new formulation for the problem, in which benchmark datasets are repurposed to learn a "router" model for this LLM selection, and we show that this problem can be reduced to a collection of binary classification tasks. We demonstrate the utility and limitations of learning model routers from various benchmark datasets, where we consistently improve performance upon using any single model for all tasks.
研究の動機と目的
- 研究目的1: Formalize learning LLM strengths for downstream routing as a collection of binary classification problems.
- 研究目的2: Propose three routing scores (S1, S2, S3) that account for correctness predictors and out-of-distribution (OOD) dynamics.
- 研究目的3: Investigate routing performance on HELM and MixInstruct benchmarks across diverse tasks and domains.
- 研究目的4: Examine effects of OOD gap reduction and the potential of using smaller LLMs through routing.
提案手法
- 方法1: Represent correctness of an LLM on an input as y(x,m) in {0,1} based on a thresholded performance metric.
- 方法2: Train a correctness predictor g_m for each LLM to estimate P(y=1|x) using embedding-based features and kNN classification.
- 方法3: Define routing scores S1, S2, S3 to compute an LLM selection score for a new task d': S1 uses predicted probability, S2 uses binary decisions, and S3 integrates an out-of-distribution accuracy p(d',m) with a meta-model of correctness.
- 方法4: Model p(d',m) as the probability that the binarized predictor is correct on task d', estimated via a task descriptor u(d) and a non-parametric regressor (Gaussian kernel).
- 方法5: Derive S3 as the expectation S3(m,d') = S2(m,d') p(d',m) + (1 - S2(m,d')) (1 - p(d',m)).
- 方法6: Compare routing strategies against baselines such as best model on average (BMA) and Oracle, and analyze the efficiency by reducing model calls at test time.

実験結果
リサーチクエスチョン
- RQ1研究質問1: Benchmark-derived correctness predictors generalize to unseen tasks to reliably route among multiple LLMs?
- RQ2研究質問2: Do routing scores S1, S2, and especially S3 improve task-specific model selection over using a single best model or log-likelihood-based scoring?
- RQ3研究質問3: How does the OOD gap between benchmarks and new tasks affect routing performance, and can small amounts of in-distribution data improve routing?
- RQ4研究質問4: What is the impact of routing through smaller LLMs on accuracy and cost across benchmarks like HELM and MixInstruct?
主な発見
| 指標 | S1 の値 (式3) | S2 の値 (式4) | S3 の値 (式7,8) | 真の p の場合の S3 の値 | LL | BMA | Oracle | Notes |
|---|---|---|---|---|---|---|---|---|
| Acc. | 0.662 | 0.676 | 0.694 | 0.735 | 0.684 | 0.688 | 0.773 | HELM routing performance across 29 datasets; higher is better |
| Ratio to Best | 0.855 | 0.868 | 0.898 | 0.944 | 0.869 | 0.884 | — | Relative performance vs. best per-task model |
| Pearson | 0.685 | 0.636 | 0.727 | 0.799 | 0.714 | — | — | Rank correlation between predicted scores and actual accuracies |
| Spearman | 0.465 | 0.468 | 0.492 | 0.596 | 0.459 | — | — | Rank correlation with model performance |
| %BMA | 0.17 | 0.10 | 0.48 | 0.22 | 0.10 | — | 0.21 | Fraction of tasks where BMA is chosen by the method |
| # Params | 40.3B | 44.3B | 49.8B | 33.8B | — | 70.0B | — | Model size of the selected LLMs across methods |
| Rank | 6.172 | 5.897 | 5.310 | 3.800 | — | 6.069 | 1.000 | Relative ranking across models |
- 主要な発見1: 計Correctness predictorsを用いることで、S3 with an estimated p(d',m)は最良の単一モデル(BMA)を上回り、しばしば小型でコスト効率の良いモデルを選択できる。
- 主要な発見2: Oracle S3 (真の p を用いる場合) が最高の精度を示し、真の精度へのアクセスがある場合のルーティング手法の潜在能力を示す。
- 主要な発見3: HELM(29データセット、18モデル)では S3 がAcc. 0.694、Pearson 0.898、Spearman 0.727と高相関を示し、S1 (0.662) および S2 (0.676) を上回る。
- 主要な発見4: S3 with true p は Acc. を 0.735、Pearson 0.944、Spearman 0.799 に向上し、BMA 0.688、Oracle 0.773 に対して優位。
- 主要な発見5: ログ尤度(L L) スコアリングも良好だが、全モデルで生成する必要がありコストが増大。
- 主要な発見6: MixInstruct の結果は、当手法がインスタンスごとの選択性で競争力があり、モデル呼び出しの大幅な効率化を実現。
- 主要な発見7: いくつかの分布内データを用いてOODギャップを減らすと、全てのスコアのルーティング性能が改善され、十分な分布内データがある場合には S1 が S3 を上回ることもある。

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。