QUICK REVIEW

[Paper Review] RobustFill: Neural Program Learning under Noisy I/O

Jacob Devlin, Jonathan Uesato|arXiv (Cornell University)|Mar 21, 2017

Advanced Neural Network Applications28 references108 citations

TL;DR

The paper compares neural program synthesis and induction on a real-world string-transformation task (FlashFill), introduces an attentional RNN that encodes variable-sized I/O sets, achieves 92% generalization accuracy on FlashFillTest, and shows robustness to noise compared to rule-based and induction approaches.

ABSTRACT

The problem of automatically generating a computer program from some specification has been studied since the early days of AI. Recently, two competing approaches for automatic program learning have received significant attention: (1) neural program synthesis, where a neural network is conditioned on input/output (I/O) examples and learns to generate a program, and (2) neural program induction, where a neural network generates new outputs directly using a latent program representation. Here, for the first time, we directly compare both approaches on a large-scale, real-world learning task. We additionally contrast to rule-based program synthesis, which uses hand-crafted semantics to guide the program generation. Our neural models use a modified attention RNN to allow encoding of variable-sized sets of I/O pairs. Our best synthesis model achieves 92% accuracy on a real-world test set, compared to the 34% accuracy of the previous best neural synthesis approach. The synthesis model also outperforms a comparable induction model on this task, but we more importantly demonstrate that the strength of each approach is highly dependent on the evaluation metric and end-user application. Finally, we show that we can train our neural models to remain very robust to the type of noise expected in real-world data (e.g., typos), while a highly-engineered rule-based system fails entirely.

Motivation & Objective

Motivate and compare neural program synthesis and neural program induction on a real-world, noisy I/O transformation task.
Develop an attention-based neural architecture capable of encoding variable-sized sets of I/O examples.
Evaluate end-to-end performance against a hand-crafted rule-based system and an induction-based approach.
Assess robustness to realistic noise (typos) in I/O examples.
Quantify how evaluation metrics (all-example vs. average-example) influence observed strengths of each approach.

Proposed method

Propose a novel variant of the attention-based RNN to encode variable-length, unordered I/O example sets via late pooling.
Represent the program in a domain-specific language (DSL) for string transformations including nested expressions and regex-based extractions.
Train end-to-end on synthetically generated I/O-Program pairs and decode with beam search, validating consistency against observed I/O pairs.
Compare program synthesis (generate P and execute on I/O) with program induction (generate outputs Oy directly) and with a hand-crafted rule-based system.
Introduce a dynamic programming-like constraint (DP-Beam) during decoding to prune inconsistent partial programs based on observed outputs.

Experimental results

Research questions

RQ1Can neural program synthesis outperform neural program induction on real-world FlashFill-like tasks?
RQ2How does encoding a variable-sized set of I/O examples with attention affect synthesis accuracy?
RQ3What is the impact of noise (typos) in I/O examples on synthesis, induction, and rule-based systems?
RQ4How do different evaluation metrics (all-example vs. average-example accuracy) shape perceived strengths of synthesis vs. induction?
RQ5Does the DSL’s expressiveness (e.g., GetSpan) contribute to generalization on real-world instances?

Key findings

System	Beam	Generalization Accuracy (test)	All-Example Accuracy (test)	Average-Example Accuracy (test)
Parisotto et al. 2017 (neural synthesis baseline)	100	34%	—	—
Basic Seq-to-Seq	100	56%	—	—
Attention-C	100	86%	—	—
Attention-C-DP	1000	92%	—	—
Induction (synthesis architecture variant)	3	—	53%	—

Attentional architectures significantly outperform basic seq-to-seq baselines (≈25 percentage points gain).
Best synthesis model achieves 92% generalization accuracy on FlashFillTest, outperforming the previous best neural approach (34%).
The neural synthesis model is far more robust to noise than a hand-crafted rule-based system (with noise, 80% vs. 6% accuracy).
Compared to neural induction, synthesis provides higher all-example generalization, while induction can offer partial correctness across assessment examples; both have complementary strengths depending on metric.
DP-Beam decoding and late pooling with double attention yield the strongest results (Attention-C-DP with Beam=1000 achieves 92% generalization).
Induction (Oy generation) achieves 53% generalization vs. 81% for synthesis under similar settings; induction performs better on average-example accuracy but lags on all-example accuracy.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.