[Paper Review] Models Know Models Best: Evaluation via Model-Preferred Formats
The paper shows that LLM evaluation outcomes depend on format (symbol-based vs. cloze) and introduces a dynamic, model-driven format-alignment method that improves zero-shot accuracy by using model-preference signals to choose the best format per problem instance.
Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models' latent capabilities.
Motivation & Objective
- Understand how evaluation format affects LLM performance on multiple-choice tasks.
- Identify task characteristics that favor likelihood-based continuation versus explicit comparison.
- Develop a format-alignment method guided by model-preference signals to improve evaluation accuracy.
- Demonstrate model-agnostic applicability of the approach across decoder-based LLMs.
Proposed method
- Compare symbol-based and cloze-style evaluation formats across multiple LLMs and benchmarks.
- Introduce a lightweight classifier trained on latent model-preference signals to select problem-specific formats.
- Use a dynamic format-alignment strategy to determine the optimal evaluation format for each instance.
- Demonstrate zero-shot accuracy improvements using the model-preference-driven format selection.
- Show that the approach is model-agnostic and improves beyond human-designed heuristics.
Experimental results
Research questions
- RQ1How do evaluation formats influence LLM performance on knowledge and reasoning tasks?
- RQ2Can a lightweight classifier leverage model-preference signals to choose the best evaluation format for a given problem?
- RQ3Does a dynamic, format-aligned evaluation strategy improve zero-shot accuracy across decoder-based LLMs?
- RQ4Are model-preference-driven formats more effective than human-designed heuristics for evaluating LLMs?
- RQ5Is the approach robust across different benchmarks and model families?
Key findings
- Symbol-based and cloze-style formats yield differing performance due to task characteristics.
- Likelihood scoring benefits natural language continuation; explicit comparison suits other formats.
- A model-trained classifier can detect latent format preferences to guide evaluation.
- The dynamic format-alignment method yields substantial zero-shot accuracy gains across benchmarks.
- The results suggest model-agnostic benefits and reveal latent capabilities more accurately.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.