[Paper Review] Specializing Smaller Language Models towards Multi-Step Reasoning
The paper shows that small models (≤11B parameters) can be specialized to excel at multi-step math reasoning by distilling CoT data from a large teacher model, trading off generic abilities for target task performance and revealing a log-linear scaling curve after specialization.
The surprising ability of Large Language Models (LLMs) to perform well on complex reasoning with only few-shot chain-of-thought prompts is believed to emerge only in very large-scale models (100+ billion parameters). We show that such abilities can, in fact, be distilled down from GPT-3.5 ($\ge$ 175B) to T5 variants ($\le$ 11B). We propose model specialization, to specialize the model's ability towards a target task. The hypothesis is that large models (commonly viewed as larger than 100B) have strong modeling power, but are spread on a large spectrum of tasks. Small models (commonly viewed as smaller than 10B) have limited model capacity, but if we concentrate their capacity on a specific target task, the model can achieve a decent improved performance. We use multi-step math reasoning as our testbed because it is a very typical emergent ability. We show two important aspects of model abilities: (1). there exists a very complex balance/ tradeoff between language models' multi-dimensional abilities; (2). by paying the price of decreased generic ability, we can clearly lift up the scaling curve of models smaller than 10B towards a specialized multi-step math reasoning ability. We further give comprehensive discussions about important design choices for better generalization, including the tuning data format, the start model checkpoint, and a new model selection method. We hope our practice and discoveries can serve as an important attempt towards specialized smaller models in the new research paradigm set by LLMs.
Motivation & Objective
- Demonstrate that small language models can achieve strong multi-step math reasoning via specialization.
- Investigate how distillation and data formats influence CoT capability in small models.
- Characterize the tradeoffs between generic (BBH) and target-specific (math) abilities.
- Examine scaling behavior and generalization (in-distribution vs. out-of-distribution) after specialization.
- Provide design recommendations for effective specialized small-model training.
Proposed method
- Fine-tune FlanT5 and T5 baselines with distillation data generated by a large teacher (code-davinci-002) to produce CoT-enabled outputs.
- Explore data formats: in-context answer-only, in-context CoT, and zero-shot formats to study their effects on abilities.
- Apply distribution matching as the distillation objective to align student and teacher per-step distributions, addressing tokenizer alignment via dynamic programming.
- Align GPT and T5 tokenizations with an alignment-based dynamic programming method.
- Evaluate using GSM8K (in-distribution) and 4 out-of-distribution math datasets (MultiArith, ASDiv, SVAMP) plus BigBench Hard for generic ability.
- Analyze the tradeoffs between specialization progress and retention of generic abilities across tuning stages.

Experimental results
Research questions
- RQ1Can small models (≤11B) achieve enhanced multi-step math reasoning by specializing toward CoT tasks?
- RQ2What is the impact of using distillation data formats and an instruction-tuned base model on specialization performance?
- RQ3How does specialization affect in-distribution vs. out-of-distribution performance and zero-shot vs. in-context abilities?
- RQ4What tradeoffs occur between preserving generic abilities (BigBench Hard) and improving target-task CoT math reasoning?
- RQ5How does model selection based on different validation signals influence final performance on in-distribution and OOD tasks?
Key findings
- Specialization improves small model math reasoning by about +10 accuracy on GSM8K average, with 3B and 11B FlanT5 models achieving strong results.
- Specialized small models can reach or approach the performance of much larger models on the target math tasks (GSM8K and OOD datasets) at the cost of degraded generic abilities on BigBench Hard.
- The scaling curve for specialized small models becomes log-linear (not flat), indicating that multi-step reasoning can scale smoothly with model size after specialization.
- Instruction-tuned bases (FlanT5) generally outperform raw pretrained bases (T5) after specialization, underscoring the benefit of starting from an instruction-tuned checkpoint.
- There are clear tradeoffs between in-distribution and out-of-distribution performance and between in-context and zero-shot abilities, with model selection depending on the desired generalization goal.
- Two distillation strategies differ in convergence speed (distribution matching faster than sampling matching) without substantial final performance differences.

Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.