QUICK REVIEW

[논문 리뷰] AutoProteinEngine: A Large Language Model Driven Agent Framework for Multimodal AutoML in Protein Engineering

Y. Liu, Zan Chen|arXiv (Cornell University)|2024. 11. 07.

Software Engineering Research인용 수 5

한 줄 요약

AutoProteinEngine은 LLM 주도 에이전트 프레임워크를 사용하여 단백질 엔지니어링을 위한 다중모달 AutoML를 수행하고, 분류 및 회귀 작업에서 제로샷 및 수동 튜닝 베이스라인을 능가합니다.

ABSTRACT

Protein engineering is important for biomedical applications, but conventional approaches are often inefficient and resource-intensive. While deep learning (DL) models have shown promise, their training or implementation into protein engineering remains challenging for biologists without specialized computational expertise. To address this gap, we propose AutoProteinEngine (AutoPE), an agent framework that leverages large language models (LLMs) for multimodal automated machine learning (AutoML) for protein engineering. AutoPE innovatively allows biologists without DL backgrounds to interact with DL models using natural language, lowering the entry barrier for protein engineering tasks. Our AutoPE uniquely integrates LLMs with AutoML to handle model selection for both protein sequence and graph modalities, automatic hyperparameter optimization, and automated data retrieval from protein databases. We evaluated AutoPE through two real-world protein engineering tasks, demonstrating substantial performance improvements compared to traditional zero-shot and manual fine-tuning approaches. By bridging the gap between DL and biologists' domain expertise, AutoPE empowers researchers to leverage DL without extensive programming knowledge. Our code is available at https://github.com/tsynbio/AutoPE.

연구 동기 및 목표

비 DL 비전문 생물학자가 자연어 인터페이스를 통해 DL 모델을 사용할 수 있도록 단백질 엔지니어링의 진입 장벽을 낮춘다.
단백질 서열 및 그래프 모달리티에 대한 모델 선택, 하이퍼파라미터 최적화, 공개 데이터베이스에서의 데이터 검색을 자동화한다.
제로샷 및 수동 튜닝 베이스라인보다 실제 단백질 태스크에서 성능 향상을 입증한다.

제안 방법

태스크를 검증하고, 데이터 전처리를 계획하며, 사전 정의된 저장소에서 모델을 선택하고(예: ESM, AlphaFold) 학습을 구성하는 LLM 주도 AutoML 파이프라인.
보완 정보를 활용하기 위해 다중모달(시퀀스 및 그래프) 단백질 데이터에 대한 레이트 융합.
Natural-language 상호작용에 의해 안내되는 Tree-structured Parzen Estimator(TPE) 및 ASHA를 통한 Auto 하이퍼파라미터 최적화(Ray.Tune).
LLM 생성 쿼리에 의해 가이드되는 UniProt 및 PDB에서의 자동 데이터 검색과 누락 데이터에 대한 대체 대화를 포함한 데이터 검색 자동화.
지표를 사용자 친화적 요약으로 해석 가능한 자연어 피드백을 통해 상호작용적으로 제공하여 해석 가능성을 높임.

Figure 1: The overview of AutoProteinEngine (AutoPE) framework. It illustrates the end-to-end workflow of AutoPE, integrating LLM-driven AutoML for protein engineering tasks. The framework consists of three main components: (1) A user-friendly chat interface for the natural language task specificati

실험 결과

연구 질문

RQ1시퀀스 및 그래프 데이터를 포함하는 단백질 엔지니어링 작업에 대해 LLM 기반 에이전트 프레임워크가 효과적인 다중모달 AutoML을 수행할 수 있는가?
RQ2LLM이 안내하는 자동 하이퍼파라미터 최적화가 제로샷 및 수동 튜닝 베이스라인보다 성능을 향상시키는가?
RQ3단백질 데이터베이스로부터의 자동 데이터 검색이 단백질 엔지니어링을 위한 DL 모델 학습을 얼마나 효과적으로 지원하는가?

주요 결과

Method	F1-score	SRCC	Accuracy
Zero-Shot	0.4764 ± 0.11	0.3769 ± 0.05	0.6917 ± 0.04
Manual Fine-Tuning	0.5709 ± 0.05	0.3098 ± 0.06	0.9137 ± 0.01
AutoPE (w/o HPO)	0.6396 ± 0.06	0.4405 ± 0.04	0.7988 ± 0.05
AutoPE (w/ HPO)	0.7306 ± 0.04	0.4621 ± 0.03	0.8908 ± 0.01

AutoPE는 자동 HPO와 함께 성능과 로버스트니스의 최적 트레이드오프를 달성하며, 제로샷 및 비-HPO 변형보다 우수한 성능을 보임.
Brazzein 당도 분류에서 AutoPE w/HPO는 F1 0.7306, SRCC 0.4621, 정확도 0.8908로 제로샷 및 w/o-HPO 변형보다 높음.
STM1221 효소 활성 회귀에서 AutoPE w/HPO는 RMSE 0.3488, MAE 0.1999, R2 0.6805로 제로샷 및 w/o-HPO 변형을 능가함.
수동 미세조정은 높은 정확도를 달성할 수 있지만 F1 및 로버스트니스 측면에서 열악할 수 있으며, HPO를 포함한 AutoPE가 더 나은 균형과 일반화성을 제공함.
프레임워크는 다중모달 데이터 융합 및 자동 데이터 검색을 지원하여 전문 DL 지식 필요성을 줄임.

Figure 2: Case study between the AutoPE in a conversational interface with conventional code-based DL workflow for brazzein protein sweetness classification task. This figure demonstrates the end-to-end process and improved usability of AutoPE: (a) A biologist without DL background uploads protein m

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.