QUICK REVIEW

[논문 리뷰] From Tokens to Blocks: A Block-Diffusion Perspective on Molecular Generation

Qianwei Yang, Dong Xu|arXiv (Cornell University)|2026. 01. 29.

Computational Drug Discovery Methods인용 수 0

한 줄 요약

SoftMol은 소프트 프래그먼트 SMILES 표현과 목표 인식 생성을 위한 게이트된 MCTS를 갖춘 블록 확산 분자 언어 모델(SoftBD)을 제시하여 100% 유효성과 더 빠른 샘플링으로 최첨단 de novo 및 표적 특이 성과를 달성합니다.

ABSTRACT

Drug discovery can be viewed as a combinatorial search over an immense chemical space, motivating the development of deep generative models for de novo molecular design. Among these, GPT-based molecular language models (MLM) have shown strong molecular design performance by learning chemical syntax and semantics from large-scale data. However, existing MLMs face two fundamental limitations: they inadequately capture the graph-structured nature of molecules when formulated as next-token prediction problems, and they typically lack explicit mechanisms for target-aware generation. Here, we propose SoftMol, a unified framework that co-designs molecular representation, model architecture, and search strategy for target-aware molecular generation. SoftMol introduces soft fragments, a rule-free block representation of SMILES that enables diffusion-native modeling, and develops SoftBD, the first block-diffusion molecular language model that combines local bidirectional diffusion with autoregressive generation under molecular structural constraints. To favor generated molecules with high drug-likeness and synthetic accessibility, SoftBD is trained on a carefully curated dataset named ZINC-Curated. SoftMol further integrates a gated Monte Carlo tree search to assemble fragments in a target-aware manner. Experimental results show that, compared with current state-of-the-art models, SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, yields a 2-3x increase in molecular diversity, and delivers a 6.6x speedup in inference efficiency. Code is available at https://github.com/szu-aicourse/softmol

연구 동기 및 목표

그래프 구조를 더 잘 포착하여 자동회귀 토큰 예측을 넘어 분자 생성을 개선하려는 동기를 제시한다.
대상 인식 설계를 위한 분자 표현, 모델 아키텍처, 탐색 전략을 공동 설계한다.
고정 길이 블록에서 작동하여 화학적 유효성을 보장하고 경계가 설정된 분자 공간 내에서 동작하는 확산 기반 블록 모델링 접근법을 도입한다.
약학적 제약 하에서 de novo 생성 및 단백질 표적 설계에서 최첨단 성능을 시연한다.

제안 방법

휴리스틱 규칙 없이 고정 길이 SMILES를 연속 블록으로 분할하여 소프트 프래그먼트를 정의한다.
SoftBD를 구현한다. 이는 국부 화학 부분구조를 모델링하기 위해 블록-확산 트랜스포머이며 블록 내 양방향 주의와 블록 간 인과 주의를 포함한다.
Drug-likeness 및 합성 접근성을 촉진하기 위해 ZINC-Curated에서 SoftBD를 훈련시킨다.
적응형 신뢰도 디코딩을 사용하여 First-Hitting 샘플링과 그리디, 신뢰도 순으로 토큰 언마스킹을 통해 블록을 반자동회귀 방식으로 생성한다.
조정 가능한 실행 가능성 게이트를 갖춘 게이트된 몬테카를로 트리 탐색(MCTS)을 통합하여 표적 단백질 쪽으로 프래그먼트를 조합한다.

실험 결과

연구 질문

RQ1블록 확산 표현이 토큰 기반 MLM과 비교하여 화학적 유효성 및 모델의 강건성에 어떤 영향을 미치는가?
RQ2확산 기반 모델링과 제약적 탐색을 결합한 표적 인식 생성 파이프라인이 결합 친화도 및 약물유사성을 개선할 수 있는가?
RQ3표현의 세분성(소프트 프래그먼트 길이)과 생성/추론 효율성 사이의 trade-off는 무엇인가?
RQ4타당성 게이트를 MCTS와 결합하는 것이 표적 특이적 분자 설계에서 히트 비율과 다양성을 개선하는가?

주요 결과

Method	Validity (%)	Uniqueness (%)	Quality (%)	Docking-Filter (%)	Diversity
SAFE-GPT (Noutahi et al., 2024)	93.2±0.1	100.0±0.0	54.4±0.6	78.3±0.5	0.879±0.000
GenMol (Lee et al., 2025)	99.9±0.1	96.0±0.3	85.2±0.4	97.8±0.1	0.817±0.000
SoftBD (p=1.0, τ=0.9)	99.8±0.0	100.0±0.0	87.1±0.2	98.5±0.1	0.871±0.000
SoftBD (p=1.0, τ=1.0)	99.6±0.0	100.0±0.0	84.7±0.2	97.8±0.1	0.878±0.000
SoftBD (p=1.0, τ=1.1)	99.1±0.0	100.0±0.0	81.7±0.3	96.5±0.1	0.883±0.000
SoftBD (p=1.0, τ=1.2)	98.3±0.0	100.0±0.0	77.7±0.3	94.2±0.2	0.888±0.000
SoftBD (p=1.0, τ=1.3)	96.7±0.1	100.0±0.0	72.9±0.3	91.1±0.2	0.893±0.000
SoftBD (p=0.95, τ=0.9)	100.0±0.0	98.4±0.1	93.5±0.2	99.8±0.0	0.844±0.000
SoftBD (p=0.95, τ=1.0)	100.0±0.0	99.4±0.1	92.8±0.0	99.7±0.0	0.851±0.000
SoftBD (p=0.95, τ=1.1)	100.0±0.0	99.6±0.1	91.9±0.1	99.6±0.0	0.858±0.000
SoftBD (p=0.95, τ=1.2)	99.9±0.0	99.8±0.0	90.8±0.1	99.3±0.1	0.867±0.000
SoftBD (p=0.95, τ=1.3)	99.9±0.0	99.8±0.1	88.9±0.2	98.9±0.1	0.871±0.000
SoftBD (p=0.9, τ=0.9)	100.0±0.0	90.0±0.2	94.9±0.2	99.9±0.0	0.829±0.000
SoftBD (p=0.9, τ=1.0)	100.0±0.0	96.0±0.1	94.0±0.2	99.8±0.0	0.839±0.000
SoftBD (p=0.9, τ=1.1)	100.0±0.0	98.0±0.1	93.3±0.3	99.8±0.0	0.846±0.000
SoftBD (p=0.9, τ=1.2)	100.0±0.0	99.1±0.1	92.4±0.2	99.7±0.1	0.852±0.000
SoftBD (p=0.9, τ=1.3)	100.0±0.0	99.3±0.1	91.7±0.2	99.6±0.0	0.858±0.000

SoftBD는 대부분의 구성에서 100%의 화학적 유효성을 달성합니다.
SoftMol은 de novo 및 표적 인식 설정에서 Baseline 대비 결합 친화도를 9.7% 향상시킵니다.
다양성은 선두 baselines와 비교해 2–3배 증가합니다.
10k 분자를 샘플링하는 경우 GenMol(이산 확산) 대비 추론 속도가 약 6.6배 빨라집니다.
SoftMol은 표적 특이 작업에서 3,000회의 시도당 거의 3,000개의 고유 후보를 유지하며 높은 고유성을 유지합니다.
고품질의 ZINC-Curated 학습 세트와 블록 확산 모델링을 사용하면 de novo 및 표적 특이 분자 설계에서 최첨단 성능을 달성합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.