QUICK REVIEW

[论文解读] BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

Dionizije Fa, Marko Čuljak|arXiv (Cornell University)|Jan 29, 2026

Cancer Genomics and Diagnostics被引用 0

一句话总结

BioAgent Bench 提供一个基准数据集和评估套件，用于衡量 AI 代理执行端到端生物信息学流程的能力，评估在扰动下的鲁棒性，并在多种 harness 下比较开源权重模型与封闭权重模型。

ABSTRACT

This paper introduces BioAgent Bench, a benchmark dataset and an evaluation suite designed for measuring the performance and robustness of AI agents in common bioinformatics tasks. The benchmark contains curated end-to-end tasks (e.g., RNA-seq, variant calling, metagenomics) with prompts that specify concrete output artifacts to support automated assessment, including stress testing under controlled perturbations. We evaluate frontier closed-source and open-weight models across multiple agent harnesses, and use an LLM-based grader to score pipeline progress and outcome validity. We find that frontier agents can complete multi-step bioinformatics pipelines without elaborate custom scaffolding, often producing the requested final artifacts reliably. However, robustness tests reveal failure modes under controlled perturbations (corrupted inputs, decoy files, and prompt bloat), indicating that correct high-level pipeline construction does not guarantee reliable step-level reasoning. Finally, because bioinformatics workflows may involve sensitive patient data, proprietary references, or unpublished IP, closed-source models can be unsuitable under strict privacy constraints; in such settings, open-weight models may be preferable despite lower completion rates. We release the dataset and evaluation suite publicly.

研究动机与目标

提供适用于 AI 代理的端到端生物信息学任务基准数据集。
比较前沿的封闭源代码模型与开放权重模型在代理驱动的工作流中的表现。
在受控扰动和数据损坏下评估代理流程的鲁棒性。
提供一个评估框架，记录转录本、评估进度并对结果打分。
通过强调有利于开放权重模型的场景，促进注重隐私的部署。

提出的方法

定义跨越 RNA-seq、变异检测、宏基因组等的端到端生物信息学任务。
使用任务提示与所需输入/参考数据形成评估单元，输出格式具有具体要求（如 CSV）。
通过 harnesses（Claude Code、Codex CLI、OpenCode）和一个 LLM 评估器来评估代理的步骤完成情况和最终产物。
结合扰动测试（损坏输入、干扰项、提示膨胀）来评估鲁棒性。
以完成率为主要指标，并分析规划质量与失败模式。
以任务级与模型级的热力图和鲁棒性统计结果来报告。

Figure 1: An overview of BioAgent Bench. Inputs to LLM agents consist of a task prompt, input data, and reference data. While solving the provided task, an agent can use general-purpose packages or specialized bioinformatics tools. After the agent finishes generation, LLM judge compares its outputs

实验结果

研究问题

RQ1前沿的封闭源模型是否在最少搭建的情况下完成多步生物信息学流水线的端到端？
RQ2开放权重模型在完成率和鲁棒性方面相较于封闭源模型在生物信息学任务中表现如何？
RQ3代理驱动的生物信息学工作流中，规划质量与流水线完成之间的关系如何？
RQ4在输入损坏、诱导项或提示膨胀的情况下，生物信息学代理流水线会出现哪些失效模式？
RQ5鲁棒性对扰动在不同任务和 harness 下的变化如何？

主要发现

Task	Trials	Jaccard	Pearson
alzheimer-mouse	4	0.160	0.219
comparative-genomics	4	0.004	NA
cystic-fibrosis	3	1.000	NA
deseq	4	0.978	0.995
evolution	4	0.000	NA
metagenomics	4	0.395	0.746
single-cell	4	0.114	0.395
transcript-quant	4	1.000	1.000
viral-metagenomics	4	0.667	1.000
perturbation-overview	-	-	-

前沿模型在流水线完成率方面表现出色，Claude Opus 4.5 达到 100%，Gemini 3 Pro、GPT-5.2、Sonnet 4.5 均高于 90%。
开放权重模型总体落后，最佳为 GLM-4.7 在 Codex CLI 中达到 82.5% 的完成率；其他大多在约 65% 左右。
规划质量与完成率相关（Pearson r = 0.61），但并非对所有模型的成功具有确定性预测性。
鲁棒性测试揭示逐步推理的脆弱性，例如对损坏输入、诱导项和提示膨胀的敏感性，提示膨胀使各任务的完成率平均下降 28%。
封闭源模型可能更易进入纠错循环，而前沿模型更常在遭遇问题后恢复并完成流水线。
在隐私受限的设置中，开放权重模型尽管完成率较低，但可能更具偏好。

Figure 2: Model-task completion heatmap. The left panel shows a pairwise completion matrix: rows and columns correspond to models and tasks, respectively, and each cell reports the completion rate (in %) for each model and task pair. Cell color encodes the completion rate, with numeric annotations s

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。