QUICK REVIEW

[논문 리뷰] Learning Molecular Representation in a Cell

Gang Liu, Srijit Seal|PubMed|2024. 06. 17.

Computational Drug Discovery Methods인용 수 7

한 줄 요약

InfoAlign은 분자 구조와 세포 반응 데이터를 맥락 그래프(context graph)로 통합하여 다중 디코더를 통한 정보 병목으로 이웃 생물학적 특징에 정렬시키고, 최소한의 충분한 분자 표현을 학습하여 분자 특성 예측 및 제로샷 분자-형태학 매칭을 향상시킵니다.

ABSTRACT

Predicting drug efficacy and safety in vivo requires information on biological responses (e.g., cell morphology and gene expression) to small molecule perturbations. However, current molecular representation learning methods do not provide a comprehensive view of cell states under these perturbations and struggle to remove noise, hindering model generalization. We introduce the Information Alignment (InfoAlign) approach to learn molecular representations through the information bottleneck method in cells. We integrate molecules and cellular response data as nodes into a context graph, connecting them with weighted edges based on chemical, biological, and computational criteria. For each molecule in a training batch, InfoAlign optimizes the encoder's latent representation with a minimality objective to discard redundant structural information. A sufficiency objective decodes the representation to align with different feature spaces from the molecule's neighborhood in the context graph. We demonstrate that the proposed sufficiency objective for alignment is tighter than existing encoder-based contrastive methods. Empirically, we validate representations from InfoAlign in two downstream applications: molecular property prediction against up to 27 baseline methods across four datasets, plus zero-shot molecule-morphology matching.

연구 동기 및 목표

세포 반응(세포 형태학 및 유전자 발현)을 분자 구조와 결합시켜 전체론적 분자 표현 학습을 촉진한다.
맥락 그래프 기반 프레임워크를 개발하여 분자와 세포 교란을 연결하고 강력한 병목 표현을 학습한다.
인코더 기반 대조 방법보다 디코더 기반 정렬의 이론적 및 경험적 이점을 입증한다.
여러 데이터세트에서 분자 특성 예측 및 제로샷 분자–형태학 매칭에 대해 InfoAlign을 평가한다.

제안 방법

분자, 세포 형태학, 유전자 발현을 노드로 하고 화학적, 생물학적, 계산적 기준으로 가중된 간선이 있는 세포 맥락 그래프를 구성한다.
맥락 그래프에서 임의 산책(random walks)를 사용해 각 학습 분자 X의 이웃 노드를 식별한다.
인코더 f_theta를 훈련시켜 X로부터 잠재 표현 Z를 생성하는 동시에 다중 디코더 g_phi를 사용해 산책 경로상의 이웃 노드의 특징을 재구성한다.
최적화는 최소성 목적 I(X;Z)와 충분성 목적 합계_v in P_X I(Z; psi(v)))를 variational bound(I_DLB 및 I_EUB)와 교차 엔트로피 손실 및 KL 정규화를 더해 근사한다.
디코더 기반 경계 I_DLB가 인코더 기반의 InfoNCE 경계(I_ELB)보다 더 촘촘한 상호정보 경계를 제공한다고 주장한다.
분자 특성 예측 및 제로샷 분자–형태학 매칭을 포함한 다운스트림 작업에서 인코더/디코더를 미세조정한다.

Figure 1: Molecular Representation Learning via the Information Bottleneck: (a) Existing contrastive learning methods utilize two encoders—one for molecules and another for cell morphology or gene expression features, lacking a holistic view of molecular representation learning in cells. (b) In cont

실험 결과

연구 질문

RQ1InfoAlign이 불필요한 정보를 제거하면서도 충분한 생물학적 신호를 보존하여 모드 간 일반화된 분자 표현을 만들어내는가?
RQ2다중 디코더가 있는 맥락 그래프 기반 정보 병목이 분자 특성 예측 및 제로샷 교차 모달 매칭에서 인코더만 사용한 대조 방법보다 성능이 우수한가?
RQ3산책 길이(워크 길이)와 사전 강도와 같은 하이퍼파라미터가 최소성과 충분성의 균형 및 다운스트림 성능에 어떤 영향을 미치는가?

주요 결과

InfoAlign은 분자 특성 예측을 위한 세 가지 분류 및 하나의 회귀 데이터세트에서 최대 19개의 벤치마크 Baseline을 능가한다.
InfoAlign은 Broad6K 분류에서 +10.58%, Biogen3K 회귀에서 +6.33%의 개선을 달성하며 최상의 벤치마크 대비 우수한 성능을 보인다.
InfoAlign은 분자–형태학 데이터셋 두 곳에서 강력한 제로샷 분자–형태학 매칭을 보여주며 CLOOME 및 InfoCORE를 여러 설정에서 능가한다.
디코더 기반 정렬은 인코더 기반의 InfoNCE 경계보다 더 촘촘한 상호정보 경계를 제공하여 제안된 접근의 이론적 이점을 지지한다.
실험에서 세포 형태학 및 유전자 발현 특징은 분자 구조를 보완하며 InfoAlign은 모달리티 간 병목 표현을 포착하여 일반화 성능을 향상시킨다.

Figure 2: Representation Learning Over Walk Paths in Context Graphs: (a) In Section 4.1 , we construct the graph with various interaction, perturbation, and cosine similarities among molecules $X$ , cell morphology profiles $C$ , and gene expression profiles $E$ . Given a training batch of molecules

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.