QUICK REVIEW

[论文解读] TSSR: Two-Stage Swap-Reward-Driven Reinforcement Learning for Character-Level SMILES Generation

Jacob Ede Levine, Yun Lyan Luo|arXiv (Cornell University)|Jan 8, 2026

Machine Learning in Materials Science被引用 0

一句话总结

本论文提出 TSSR，一种两阶段强化学习框架，先通过局部令牌交换修复句法错误，再使用 RDKit 诊断工具对逐字符 SMILES 生成进行化学可行性评估，以提高有效性与新颖性；在 MOSES 数据集上在从零开始训练和微调两种模式下均显示出更高的有效性与新颖性。

ABSTRACT

The design of reliable, valid, and diverse molecules is fundamental to modern drug discovery, as improved molecular generation supports efficient exploration of the chemical space for potential drug candidates and reduces the cost of early design efforts. Despite these needs, current chemical language models that generate molecules as SMILES strings are vulnerable to compounding token errors: many samples are unparseable or chemically implausible, and hard constraints meant to prevent failure can restrict exploration. To address this gap, we introduce TSSR, a Two-Stage, Swap-Reward-driven reinforcement learning (RL) framework for character-level SMILES generation. Stage one rewards local token swaps that repair syntax, promoting transitions from invalid to parseable strings. Stage two provides chemistry-aware feedback from RDKit diagnostics, rewarding reductions in valence, aromaticity, and connectivity issues. The reward decomposes into interpretable terms (swap efficiency, error reduction, distance to validity), is model agnostic, and requires no task-specific labels or hand-crafted grammars. We evaluated TSSR on the MOSES benchmark using a GRU policy trained with PPO in both pure RL (P-RL) from random initialization and fine-tuning RL (F-RL) starting from a pretrained chemical language model, assessing 10,000 generated SMILES per run. In P-RL, TSSR significantly improves syntactic validity, chemical validity, and novelty. In F-RL, TSSR preserves drug-likeness and synthesizability while increasing validity and novelty. Token-level analysis shows that syntax edits and chemistry fixes act jointly to reduce RDKit detected errors. TSSR converts a sparse terminal objective into a denser and more interpretable reward, improving both syntactic and chemical quality without reducing diversity. TSSR is dataset-agnostic and can be adapted to various reinforcement learning approaches.

研究动机与目标

通过 SMILES 实现可靠、有效且多样的从头生成新分子
为逐字符 SMILES 生成提供密集、可解释的反馈以引导
开发模型与数据集无关的 RL 框架，可从零开始或通过微调应用
在 MOSES 基准上展示句法/化学有效性与新颖性的提升
展示该方法与标准 RL 方法的兼容性，无需手工设计语法规则

提出的方法

提出两阶段奖励：阶段一奖励局部令牌交换以修复句法使 SMILES 可解析
阶段二在句法修复后对 RDKit 检测到的化学问题进行减少
使用与模型无关的奖励分解，包括交换效率、错误减少和有效性距离
在两种模式下使用基于 GRU 的化学语言模型并采用 PPO 训练：P-RL（随机初始化）和 F-RL（预训练模型）
在 MOSES 数据上操作，令牌先验来自全局令牌频率和标准 SMILES 词汇表
提供令牌级分析并公开报告交换次数、修复率和化学错误减少，以解释学习动态

Figure 1: Example a Two-Stage, Swap-Reward-driven (TSSR) reinforcement learning (RL) framework for character-level SMILES generation.

实验结果

研究问题

RQ1两阶段交换奖励的 RL 框架是否可改善逐字符 SMILES 生成的句法有效性？
RQ2阶段二的化学感知反馈在阶段一修复后是否降低了 RDKit 检测的错误？
RQ3对于从零开始训练的模型和预训练模型，使用 TSSR 优化后有效性与新颖性是否得到提升？
RQ4TSSR 对药物相似性、可合成性、多样性和支架多样性在生成分子中的影响如何？
RQ5该方法是否对数据集与模型无关，且可与 PPO 等标准 RL 流程兼容？

主要发现

TSSR 在 P-RL 下显著提升句法有效性，并将化学有效性与新颖性相比未训练基线有所提升
在 P-RL 中，句法有效性从 6.14% 提升到 35.03%，化学有效性从 4.77% 提升到 9.61%，伴随显著的新颖性提升
在 F-RL 中，有效性平均提升较小（约 0.83%），但保持高新颖性（约 99.6%），总体化学有效性提升至 19.20%
阶段一的交换与阶段二的修复协同工作，句法修复使后续的化学修正成为可能并降低了 RDKit 检测的错误
TSSR 提供更密集、可解释的奖励信号，在不牺牲多样性的前提下同时提升句法与化学质量
P-RL 显示更高的峰值奖励和学习效率，而 F-RL 受益于预训练先验，吞吐量更高但有效性提升略小

Figure 2: Examples of TSSR Stage Two fixes: Invalid SMILES to Chemically valid with upto 3 Fixes Each

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。