[Paper Review] Synchromesh: Reliable code generation from pre-trained language models
Synchromesh introduces Target Similarity Tuning (TST) for selecting semantically relevant few-shot examples and Constrained Semantic Decoding (CSD) to enforce language-specific constraints during code generation, improving reliability of LLMs like GPT-3 and Codex across SQL, Vega-Lite, and SMCalFlow without fine-tuning.
Large pre-trained language models have been used to generate code,providing a flexible interface for synthesizing programs from natural language specifications. However, they often violate syntactic and semantic rules of their output language, limiting their practical usability. In this paper, we propose Synchromesh: a framework for substantially improving the reliability of pre-trained models for code generation. Synchromesh comprises two components. First, it retrieves few-shot examples from a training bank using Target Similarity Tuning (TST), a novel method for semantic example selection. TST learns to recognize utterances that describe similar target programs despite differences in surface natural language features. Then, Synchromesh feeds the examples to a pre-trained language model and samples programs using Constrained Semantic Decoding (CSD): a general framework for constraining the output to a set of valid programs in the target language. CSD leverages constraints on partial outputs to sample complete correct programs, and needs neither re-training nor fine-tuning of the language model. We evaluate our methods by synthesizing code from natural language descriptions using GPT-3 and Codex in three real-world languages: SQL queries, Vega-Lite visualizations and SMCalFlow programs. These domains showcase rich constraints that CSD is able to enforce, including syntax, scope, typing rules, and contextual logic. We observe substantial complementary gains from CSD and TST in prediction accuracy and in effectively preventing run-time errors.
Motivation & Objective
- Motivate and address failures of large language models in generating syntactically and semantically valid code from natural language descriptions.
- Propose Target Similarity Tuning (TST) to select semantically relevant few-shot examples that align with the intended program structure.
- Propose Constrained Semantic Decoding (CSD) to enforce rich language-specific constraints during decoding without retraining the model.
- Demonstrate the framework on real-world languages (SQL, Vega-Lite, SMCalFlow) showing improvements in accuracy and validity.
Proposed method
- Develop Target Similarity Tuning (TST): fine-tune a similarity model to predict similarity between target programs from their descriptions, optimizing for program-structure similarity (based on AST tree edit distance).
- Introduce Completion Engines (CEs) as a abstraction to encode syntactic and semantic constraints for a target language and derive valid next-token completions.
- Formalize Constrained Semantic Decoding (CSD): sample next tokens from the set of tokens that maintain partial programs within the language’s completion set, using Brzozowski derivatives to decide prefix-closure membership of partial programs.
- Derive completions from grammars: use ANTLR-derived parser states to enumerate allowed next tokens and build context-free and context-sensitive constraint layers within CEs.
- Provide a decoding procedure: construct a decision procedure for the language’s prefix-closure L^c, then constrain the LLM’s token sampling to V_M(s) = {t | st ∈ L^c}, ensuring generated programs satisfy constraints.
Experimental results
Research questions
- RQ1Can dynamic, semantic-aware selection of few-shot examples (TST) improve the semantic alignment between user utterances and target programs?
- RQ2Can a constraint-driven decoding framework (CSD) enforce syntax, scope, typing, and domain-specific semantics to reduce runtime and semantic errors without retraining LLMs?
- RQ3Do TST and CSD provide complementary gains across multiple real-world target languages (SQL, Vega-Lite, SMCalFlow) when using GPT-3 and Codex?
- RQ4How does the reliability and accuracy of language-to-code generation improve with constrained decoding compared to generate-then-test approaches?
Key findings
- TST significantly boosts performance for GPT-3 and Codex by retrieving semantically relevant examples based on target program similarity rather than surface-language similarity.
- CSD enforces constraints during decoding, dramatically increasing output validity and reducing runtime errors across all three domains (SQL, Vega-Lite, SMCalFlow).
- Combining TST and CSD yields the best results, with complementary benefits: TST guides toward structurally similar targets, while CSD guarantees constraint-satisfying completions.
- CSD adds modest overhead (around 8%) during sampling, but substantially increases validity and execution success, especially for longer programs.
- Across models and domains, system-augmented results approach supervised baselines and improve over generate-then-test methods.
- Longer programs benefit most from system, with accuracy decay slowed and validity remaining high compared to baselines.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.