QUICK REVIEW

[論文レビュー] Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs

Ido Pinto, Yizhak Yisrael Elboher|arXiv (Cornell University)|Mar 16, 2026

Logic, programming, and type systems被引用数 0

ひとこと要約

The paper presents Wonda, a data-curation pipeline that turns noisy verifier-generated invariants into high-quality training signals for fine-tuning small language models, yielding substantial verification performance gains on hard instances.

ABSTRACT

The synthesis of inductive loop invariants is a critical bottleneck in automated program verification. While Large Language Models (LLMs) show promise in mitigating this issue, they often fail on hard instances, generating invariants that are invalid or computationally ineffective. While fine-tuning is a natural route to mitigate this limitation, obtaining high-quality training data for invariant generation remains an open challenge. We present a rigorous data curation pipeline designed to extract high-quality training signals from raw verifier-generated invariants. First, we formalize the properties required for a high-quality training invariant. Second, we propose Wonda, a pipeline that refines noisy data via AST-based normalization, followed by LLM-driven semantic rewriting and augmentation with provable quality guarantees. We demonstrate that fine-tuning Small Language Models (SLMs) on this curated dataset result in consistent and significant performance gain. In particular, a fine-tuned 4B parameter model matches the utility of a GPT-OSS-120B baseline and approaches the state-of-the-art GPT-5.2, without incurring reasoning-time overhead. On challenging instances from the recent InvBench evaluation suite, our approach doubles the invariant correctness and speedup rates of base models; and improves their Virtual Best Performance (VBP) rates on the verification task by up to 14.2%.

研究の動機と目的

Formalize the requirements for high-quality training invariants in program verification.
Design and validate a data-curation pipeline (Wonda) to transform noisy verifier outputs into learnable training signals.
Demonstrate that fine-tuning Small Language Models (SLMs) on curated data improves invariant generation performance.
Evaluate across multiple base models and benchmark suites to quantify speedups and correctness gains.

提案手法

Define non-degenerate, correct, useful, and compact invariants as training targets.
Ground training data in verifier-generated invariants from UAutomizer and apply AST-based normalization to canonicalize structure.
Use an LLM-driven semantic rewriting step to produce compact, interpretable invariant candidates.
Verify transformed invariants with formal verification tools to ensure correctness and sufficiency.
Introduce a quality grading mechanism to filter golden training samples (Q ≥ 2) based on parallel correctness and sufficiency checks.
Evaluate SLM fine-tuning on V2 dataset (7,283 samples) against large models using a portfolio/Virtual Best Performance (VBP) framework.

実験結果

リサーチクエスチョン

RQ1Can high-quality training data improve invariant generation for SLMs beyond raw solver outputs?
RQ2What data-curation methods (normalization and semantic simplification) best enhance learnability and usefulness of invariants?
RQ3Do fine-tuned SLMs on curated data achieve competitive verification performance compared to large LLMs on hard benchmarks?
RQ4How does the Virtual Best Performance metric reveal practical gains when LLMs run in parallel with symbolic verifiers?

主な発見

Model	R_valid (%)	R_correct (%)	R_speedup (%)	S_bar (>1) (x)	VBP (s)	VBP_E2E (s)	Solved
GPT-5.2	94.0 ± 1.7	72.4 ± 2.2	37.1 ± 1.2	10.7 ± 0.4	155.6 ± 3.0	163.4 ± 3.0	3, 2, 3
GPT-OSS-120B	92.1 ± 1.2	58.0 ± 1.2	27.4 ± 2.9	7.0 ± 1.4	165.8 ± 5.6	167.6 ± 5.7	3, 2, 1
Qwen3-8B (Base)	89.4 ± 7.8	23.9 ± 3.1	10.8 ± 0.5	8.5 ± 5.2	181.6 ± 4.3	181.7 ± 4.2	0, 0, 3
Qwen3-8B-V2 (Ours)	100.0 ± 0.0	42.8 ± 4.6	21.7 ± 1.7	10.7 ± 2.3	166.5 ± 4.3	166.7 ± 4.3	2, 1, 4
Qwen3-4B (Base)	99.2 ± 0.0	22.8 ± 2.2	11.1 ± 1.0	8.9 ± 2.5	185.6 ± 2.3	185.7 ± 2.3	1, 0, 1
Qwen3-4B-V2 (Ours)	100.0 ± 0.0	44.4 ± 2.3	24.7 ± 1.2	12.4 ± 2.2	165.5 ± 3.2	165.7 ± 3.2	3, 2, 2
Qwen3-0.6B (Base)	88.4 ± 0.5	28.5 ± 2.8	12.2 ± 2.2	5.3 ± 3.3	182.9 ± 5.7	183.0 ± 5.7	2, 0, 1
Qwen3-0.6B-V2 (Ours)	99.7 ± 0.5	27.9 ± 0.5	14.1 ± 2.5	8.5 ± 3.1	174.0 ± 5.6	174.1 ± 5.6	2, 2, 1

A carefully curated data pipeline (Wonda) yields significant performance gains for fine-tuned SLMs on invariant generation.
A 4B-parameter SLM fine-tuned on Wonda data matches the utility of GPT-OSS-120B and approaches GPT-5.2 in verification tasks without added reasoning-time overhead.
On InvBench hard instances, fine-tuned models double invariant correctness and speedup rates compared to baselines.
VBP (portfolio-based) results show substantial wall-clock improvements, with reductions of 12–20 seconds in several cases.
QL SLMs with V2 data substantially outperform their V0/V1 counterparts across multiple model scales (e.g., Qwen3-4B-V2: correctness 44.4% vs 22.8%; speedup 24.7% vs 11.1%).
Large-model baselines (GPT-5.2, GPT-OSS-120B) remain competitive in VBP, but V2-trained 4B models provide comparable end-to-end performance when considering inference latency.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。