[Paper Review] ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs
ReLE introduces a scalable evaluation system and a domain×capability benchmark to diagnose capability anisotropy in Chinese LLMs, achieving significant cost reductions while revealing high ranking instability across dimensions.
Large Language Models (LLMs) have achieved rapid progress in Chinese language understanding, yet accurately evaluating their capabilities remains challenged by benchmark saturation and prohibitive computational costs. While static leaderboards provide snapshot rankings, they often mask the structural trade-offs between capabilities. In this work, we present ReLE (Robust Efficient Live Evaluation), a scalable system designed to diagnose Capability Anisotropy, the non-uniformity of model performance across domains. Using ReLE, we evaluate 304 models (189 commercial, 115 open-source) across a Domain $ imes$ Capability orthogonal matrix comprising 207,843 samples. We introduce two methodological contributions to address current evaluation pitfalls: (1) A Symbolic-Grounded Hybrid Scoring Mechanism that eliminates embedding-based false positives in reasoning tasks; (2) A Dynamic Variance-Aware Scheduler based on Neyman allocation with noise correction, which reduces compute costs by 70\% compared to full-pass evaluations while maintaining a ranking correlation of $ρ=0.96$. Our analysis reveals that aggregate rankings are highly sensitive to weighting schemes: models exhibit a Rank Stability Amplitude (RSA) of 11.4 in ReLE versus $\sim$5.0 in traditional benchmarks, confirming that modern models are highly specialized rather than generally superior. We position ReLE not as a replacement for comprehensive static benchmarks, but as a high-frequency diagnostic monitor for the evolving model landscape.
Motivation & Objective
- Diagnose capability anisotropy in Chinese LLMs by decomposing performance across orthogonal domain and capability dimensions.
- Develop a scalable, cost-efficient evaluation pipeline suitable for 300+ models in industrial settings.
- Provide a structured benchmark with fresh data to mitigate saturation and reveal orthogonal capability trade-offs.
- Quantify ranking stability and provide diagnostic metrics to inform model selection beyond aggregate scores.
Proposed method
- Implement a Unified Prompt Schema to standardize inputs across 12 task types and 7 domains with a model-specific adapter layer.
- Use a three-tier Hybrid Verification scoring pipeline to balance precision and scalability with bias mitigation.
- Adopt a Stratified Sequential Variance-Reduction Sampling (Neyman allocation) to reduce evaluation costs while controlling error via Hoeffding-Serfling bounds.
- Construct a Domain × Capability matrix with 22 dimensions and 317 sub-tasks to decouple knowledge domains from cognitive capabilities.
- Define and compute metrics such as Rank Stability Amplitude (RSA), Capability Inconsistency (CI), and Anisotropy Index to diagnose instability and anisotropy.
- Evaluate 304 models (189 commercial, 115 open-source) on 207,843 samples using dynamic, cost-aware scheduling and fresh data.
Experimental results
Research questions
- RQ1How does model capability anisotropy manifest across diverse Chinese LLMs when evaluated on an orthogonal Domain × Capability matrix?
- RQ2What are the cost and accuracy trade-offs of a variance-aware dynamic sampling strategy for large-scale LLM evaluation?
- RQ3Can structured, domain-capability decomposition reveal instability in rankings that aggregate scores mask?
- RQ4To what extent do commercial versus open-source models differ across professional and reasoning domains in Chinese NLP?
- RQ5How effective is the proposed scoring and decontamination framework at reducing bias and contamination while maintaining ranking fidelity?
Key findings
- The ReLE framework yields a high ranking instability under weight perturbations, with a Mean RSA of 11.4 versus ~5.0 for traditional benchmarks.
- Anisotropy Index calculated as 1 minus the average inter-dimension correlation is 0.74, indicating strong capability anisotropy across dimensions.
- Commercial models lead in professional domains, but top open-source models close the gap in general reasoning; multi-agent tool-use models outperform general models in Tool Use (74.8 vs 62.4).
- Cost-efficient dynamic sampling reduces evaluation costs by ~70% (from $69,000 to $20,700 for 304 models) while maintaining a ranking correlation ρ = 0.96 with full-set evaluations.
- Ranking instability is statistically significant compared to baselines (p<0.001); RSA distributions differ between ReLE and C-Eval/CLUE, with bootstrap 95% CI non-overlap.
- A full-set control shows dynamic sampling preserves 94.8% of the capability signal, indicating robustness to sampling strategy.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.