QUICK REVIEW

[논문 리뷰] RetroReasoner: A Reasoning LLM for Strategic Retrosynthesis Prediction

Hanbum Ko, Chanhui Lee|arXiv (Cornell University)|2026. 03. 13.

Machine Learning in Materials Science인용 수 0

한 줄 요약

RetroReasoner는 화학자 스타일의 결합 분리 전략을 따르는 합성 회로 근거를 가진 추론 가능 LLM을 도입하고, 합성 합리성 데이터로 훈련하며, 예측된 반응물의 타당성과 다양성을 향상시키기 위해 왕복 보상으로 강화됩니다.

ABSTRACT

Retrosynthesis prediction is a core task in organic synthesis that aims to predict reactants for a given product molecule. Traditionally, chemists select a plausible bond disconnection and derive corresponding reactants, which is time-consuming and requires substantial expertise. While recent advancements in molecular large language models (LLMs) have made progress, many methods either predict reactants without strategic reasoning or conduct only a generic product analysis, rather than reason explicitly about bond-disconnection strategies that logically lead to the choice of specific reactants. To overcome these limitations, we propose RetroReasoner, a retrosynthetic reasoning model that leverages chemists' strategic thinking. RetroReasoner is trained using both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we introduce SyntheticRetro, a framework that generates structured disconnection rationales alongside reactant predictions. In the case of RL, we apply a round-trip accuracy as reward, where predicted reactants are passed through a forward synthesis model, and predictions are rewarded when the forward-predicted product matches the original input product. Experimental results show that RetroReasoner not only outperforms prior baselines but also generates a broader range of feasible reactant proposals, particularly in handling more challenging reaction instances.

연구 동기 및 목표

화학자들의 결합 분리 전략에 맞춘 명시적 전략적 추론으로 역합성 예측에 동기를 부여한다.
반응물 예측과 함께 구조화된 추론을 생성하는 데이터 생성 프레임워크(SyntheticRetro)를 개발한다.
SyntheticRetro 데이터에 대한 감독형 미세조정(SFT)으로 RetroReasoner를 훈련하고 왕복 보상을 사용하는 강화학습(RL)으로 정제한다.
특히 어려운 및 희귀 반응 유형에서 더 넓고 실행 가능한 반응물 제안의 정확도 향상을 시연한다.

제안 방법

SyntheticRetro는 StructuredReasoning 데이터(R1–R4)와 결합 텍스트를 생성하여 화학자의 전략을 추론 데이터로 변환한다.
RetroReasoner는 Qwen3-8B 모델에서 초기화되며, 두 단계로 훈련된다: SyntheticRetro 기반 타깃에 대한 SFT와 왕복 정확도 보상으로의 RL.
RL은 Forward 합성 검증기와 함께 GRPO(Group Relative Policy Optimization)를 활용하여 원래의 생성물을 재현한 반응물 세트를 보상한다.
전방 모델 검증기는 제안된 반응물로부터 생성물을 예측하여 정책 업데이트를 위한 왕복 보상을 계산한다.
평가에는 탐욕적 및 표본화 지표가 포함되며 제안된 반응 경로의 실행 가능성과 다양성을 강조한다.

실험 결과

연구 질문

RQ1명시적이고 화학자 유사한 전략적 추론이 순전형 예측 LLM에 비해 역합성 예측을 개선할 수 있는가?
RQ2SyntheticRetro 추론 데이터로 훈련을 보강하고 왕복 RL을 적용하면 더 넓고 실행 가능한 반응물 제안이 나오는가?
RQ3희귀한 반응 템플릿과 희귀 원자/토큰 인스턴스에서 RetroReasoner의 성능과 다양성은 어떻게 나타나는가?

주요 결과

모델	정확도@1	왕복@1	정확도@100	왕복@100	실행 가능 비율	템플릿 다양성
Prediction-Only (SFT)	0.482	0.784	0.678	0.950	0.774	2.562
Prediction-Only (RL)	0.486	0.802	0.662	0.936	0.785	2.324
RetroReasoner (SFT)	0.512	0.812	0.734	0.944	0.765	3.898
RetroReasoner (RL)	0.526	0.826	0.724	0.952	0.786	3.186

RetroReasoner는 정확 일치와 왕복 메트릭 모두에서 기준선보다 우수하며, Exact@100과 템플릿 다양성에서 뚜렷한 이점을 보인다.
SFT에 이어 RL은 정확도를 높이고 더 넓은 실행 가능한 반응물 공간을 제공하지만, RL은 실행 가능 영역에 집중하기 위해 추론 다양성을 감소시킨다.
RetroReasoner는 희귀 템플릿과 희귀 원자/토큰 인스턴스를 포함한 어려운 데이터셋에서 견고한 성능을 보인다.
구조화된 추론 단계 간의 연결 텍스트를 도입하는 것이 정확 일치와 다양성을 크게 향상시킨다.
왕복 보상은 실행 가능한 반응물 공간을 확장하지만 높은 정확 일치 지표를 유지하려면 왕복 프레임워크가 필요하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.