Skip to main content
QUICK REVIEW

[Paper Review] When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

Bogdan Zagribelnyy, Ivan Ilin|arXiv (Cornell University)|Feb 3, 2026
Machine Learning in Materials Science0 citations
TL;DR

The paper introduces ChemCensor, a plausibility-based metric for single-step retrosynthesis, builds CREED (~6.4M reactions) and URSA-expert-2026 benchmark, and shows that training a Chemistry C onstraint–Consistent Language Model (C3LM) on CREED improves retrosynthesis performance beyond standard baselines.

ABSTRACT

Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.

Motivation & Objective

  • Motivate limitations of exact-match Top-K metrics in single-step retrosynthesis evaluation.
  • Propose ChemCensor to evaluate chemical plausibility using reaction precedents and functional-group context.
  • Create CREED as a large, validated reaction dataset for model training and benchmarking.
  • Introduce URSA-expert-2026 as an expert-annotated out-of-domain benchmark for SSRS.
  • Demonstrate that training C3LM on CREED yields superior performance on both URSA-expert-2026 and USPTO-50K-test compared to baselines.

Proposed method

  • Define ChemCensor as a precedent-based score (0-5) based on reaction centers (RC) and functional-group (FG) context with alignment to a USPTO-full-derived knowledge base.
  • Decompose reactions into RC and FG signatures and compare against a curated precedent library to assign CC scores.
  • Construct CREED (~6.4M reactions) using forward and retro generation pipelines, CC-based verification, and decontamination against USPTO-50K-test.
  • Assemble URSA-expert-2026 as a benchmark of 100 novel, expert-annotated targets with synthetic feasibility validation by chemists.
  • Train C3LM on CREED (with/without USPTO-full data), with Supervised Fine-Tuning and optional Reasoning traces; apply Reinforcement Learning Fine-Tuning using ChemCensor reward and MT-based signals.

Experimental results

Research questions

  • RQ1Can chemical plausibility-based metrics better capture retrosynthetic quality than Top-K exact-match metrics?
  • RQ2How well do LLMs perform on SSRS when evaluated with ChemCensor and URSA-expert-2026 compared to traditional benchmarks?
  • RQ3Does training on a large, plausibility-verified dataset (CREED) improve generalization and RC/FGs compatibility in SSRS?
  • RQ4What is the impact of reasoning traces and reinforcement learning rewards on plausibility-oriented retrosynthesis outputs?
  • RQ5How transferable are improvements from CREED-trained models to standard benchmarks like USPTO-50K?

Key findings

  • ChemCensor provides a 0-5 plausibility score reflecting precedent support and RC/FG compatibility for retrosynthetic steps.
  • URSA-expert-2026 presents a harder out-of-domain benchmark where model performance drops compared with USPTO-50K; many baselines struggle with plausibility.
  • CREED contains ~6.4 million reactions and is validated for chemical plausibility, enabling diverse yet plausible SSRS outputs.
  • C3LM trained on CREED (CREED-only or CREED+USPTO-full) achieves best or near-best ChemCensor scores on both URSA-expert-2026 and USPTO-50K-test.
  • Reasoning-enabled fine-tuning improves Avg. PT-Max CC, indicating better coverage of plausible reaction contexts; RL with ChemCensor reward boosts plausibility but may affect diversity.
  • Molecular-Transformer rewards can degrade URSA-expert-2026 performance due to domain generalization limits.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.