QUICK REVIEW

[논문 리뷰] TSSR: Two-Stage Swap-Reward-Driven Reinforcement Learning for Character-Level SMILES Generation

Jacob Ede Levine, Yun Lyan Luo|arXiv (Cornell University)|2026. 01. 08.

Machine Learning in Materials Science인용 수 0

한 줄 요약

논문은 TSSR을 소개하는데, 이는 먼저 국지 토큰 스왑으로 구문 오류를 수정하고, 이후 RDKit 진단을 사용해 문자 수준 SMILES 생성을 화학적 타당성으로 향상시키는 두 단계 강화 학습 프레임워크이다. 이는 학습 시작부터 시작하는 학습과 미세조정 두 regime에서 MOSES에서의 유효성 및 참신성 향상을 보여준다.

ABSTRACT

The design of reliable, valid, and diverse molecules is fundamental to modern drug discovery, as improved molecular generation supports efficient exploration of the chemical space for potential drug candidates and reduces the cost of early design efforts. Despite these needs, current chemical language models that generate molecules as SMILES strings are vulnerable to compounding token errors: many samples are unparseable or chemically implausible, and hard constraints meant to prevent failure can restrict exploration. To address this gap, we introduce TSSR, a Two-Stage, Swap-Reward-driven reinforcement learning (RL) framework for character-level SMILES generation. Stage one rewards local token swaps that repair syntax, promoting transitions from invalid to parseable strings. Stage two provides chemistry-aware feedback from RDKit diagnostics, rewarding reductions in valence, aromaticity, and connectivity issues. The reward decomposes into interpretable terms (swap efficiency, error reduction, distance to validity), is model agnostic, and requires no task-specific labels or hand-crafted grammars. We evaluated TSSR on the MOSES benchmark using a GRU policy trained with PPO in both pure RL (P-RL) from random initialization and fine-tuning RL (F-RL) starting from a pretrained chemical language model, assessing 10,000 generated SMILES per run. In P-RL, TSSR significantly improves syntactic validity, chemical validity, and novelty. In F-RL, TSSR preserves drug-likeness and synthesizability while increasing validity and novelty. Token-level analysis shows that syntax edits and chemistry fixes act jointly to reduce RDKit detected errors. TSSR converts a sparse terminal objective into a denser and more interpretable reward, improving both syntactic and chemical quality without reducing diversity. TSSR is dataset-agnostic and can be adapted to various reinforcement learning approaches.

연구 동기 및 목표

SMILES를 통한 신뢰할 수 있고 유효하며 다양한 신약 후보 분자 생성을 촉진
토큰 수준 SMILES 생성을 안내하기 위한 밀도 높고 해석 가능한 피드백 제공
처음부터 또는 미세조정을 통해 적용 가능한 모델- 및 데이터세트에 구애받지 않는 RL 프레임워크 개발
MOSES 벤치마크에서 구문적/화학적 타당성 및 참신성의 개선 시연
수작업 문법 없이 표준 RL 방법과의 호환성 보이기

제안 방법

구문을 복구하기 위해 로컬 토큰 스왑을 보상하는 1단계 보상 제안: 구문을 parseable SMILES로 생성
2단계는 구문 복구 후 RDKit에서 탐지된 화학적 이슈 감소를 보상
스와프 효율성, 오류 감소 및 타당성까지의 거리 등을 포함한 모델-무관 보상 분해 사용
PPO로 학습된 GRU 기반 화학언어모델을 두 가지 regime에서 사용: P-RL(무작위 초기화) 및 F-RL(사전학습 모델)
글로벌 토큰 빈도와 표준 SMILES 어휘에서 파생된 토큰 우선순위를 사용하여 MOSES 데이터에 대해 운용
토큰 수준 분석 제공 및 학습 역학 해석을 위해 스왑 수, 수정율 및 화학적 오류 감소를 공개적으로 보고

Figure 1: Example a Two-Stage, Swap-Reward-driven (TSSR) reinforcement learning (RL) framework for character-level SMILES generation.

실험 결과

연구 질문

RQ1두 단계 스왑-보상 RL 프레임워크가 문자 수준 SMILES의 구문 타당성을 향상시킬 수 있는가?
RQ2Stage Two의 화학 인지 피드백이 Stage One 수리 후 RDKit 탐지 오류를 줄이는가?
RQ3TSSR로 최적화했을 때 처음부터 학습한 모델과 사전 학습된 모델의 타당성 및 참신성이 향상되는가?
RQ4생성 분자의 약물-유사성, 합성 가능성, 다양성 및 골격 다양성에 TSSR의 영향은 무엇인가?
RQ5이 접근법이 데이터세트- 및 모델-에 구애받지 않으며 PPO와 같은 표준 RL 파이프라인과 호환되는가?

주요 결과

TSSR은 P-RL에서 구문 타당성을 크게 향상시키고 미학습 기준 대비 화학적 타당성 및 참신성을 높임
P-RL에서 구문 타당성은 6.14%에서 35.03%로, 화학적 타당성은 4.77%에서 9.61%로 증가했으며 눈에 띄는 참신성 향상
F-RL에서 타당성은 평균 0.83% 증가로 완만하게 상승했으며 높은 참신성(~99.6%)을 유지했고 전체 화학 타당성은 19.20%로 상승
Stage One 스왑과 Stage Two 수정은 호응적으로 작동하여 구문 수리로 인해 화학 보정이 가능해 RDKit 탐지 오류를 감소시킴
TSSR은 더 촘촘하고 해석 가능한 보상 신호를 제공하여 다채성을 해치지 않으면서 구문 및 화학 품질을 모두 향상시킴
P-RL은 피크 보상 및 학습 효율이 더 높고, F-RL은 사전 학습된 우선 정보를 통해 처리량은 더 높지만 타당성 이득은 약간 작음

Figure 2: Examples of TSSR Stage Two fixes: Invalid SMILES to Chemically valid with upto 3 Fixes Each

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.