Skip to main content
QUICK REVIEW

[Paper Review] Models Know Models Best: Evaluation via Model-Preferred Formats

Joonhak Lee, Sungmok Jung|arXiv (Cornell University)|Jan 30, 2026
Topic Modeling0 citations
TL;DR

The paper shows that LLM evaluation outcomes depend on format (symbol-based vs. cloze) and introduces a dynamic, model-driven format-alignment method that improves zero-shot accuracy by using model-preference signals to choose the best format per problem instance.

ABSTRACT

Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models' latent capabilities.

Motivation & Objective

  • Understand how evaluation format affects LLM performance on multiple-choice tasks.
  • Identify task characteristics that favor likelihood-based continuation versus explicit comparison.
  • Develop a format-alignment method guided by model-preference signals to improve evaluation accuracy.
  • Demonstrate model-agnostic applicability of the approach across decoder-based LLMs.

Proposed method

  • Compare symbol-based and cloze-style evaluation formats across multiple LLMs and benchmarks.
  • Introduce a lightweight classifier trained on latent model-preference signals to select problem-specific formats.
  • Use a dynamic format-alignment strategy to determine the optimal evaluation format for each instance.
  • Demonstrate zero-shot accuracy improvements using the model-preference-driven format selection.
  • Show that the approach is model-agnostic and improves beyond human-designed heuristics.

Experimental results

Research questions

  • RQ1How do evaluation formats influence LLM performance on knowledge and reasoning tasks?
  • RQ2Can a lightweight classifier leverage model-preference signals to choose the best evaluation format for a given problem?
  • RQ3Does a dynamic, format-aligned evaluation strategy improve zero-shot accuracy across decoder-based LLMs?
  • RQ4Are model-preference-driven formats more effective than human-designed heuristics for evaluating LLMs?
  • RQ5Is the approach robust across different benchmarks and model families?

Key findings

  • Symbol-based and cloze-style formats yield differing performance due to task characteristics.
  • Likelihood scoring benefits natural language continuation; explicit comparison suits other formats.
  • A model-trained classifier can detect latent format preferences to guide evaluation.
  • The dynamic format-alignment method yields substantial zero-shot accuracy gains across benchmarks.
  • The results suggest model-agnostic benefits and reveal latent capabilities more accurately.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.