QUICK REVIEW

[論文レビュー] BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

Dionizije Fa, Marko Čuljak|arXiv (Cornell University)|Jan 29, 2026

Cancer Genomics and Diagnostics被引用数 0

ひとこと要約

BioAgent Bench は、AI エージェントがエンドツーエンドのバイオインフォマティクスパイプラインを実行する能力を測定し、摂動に対するロバスト性を評価し、複数のハーネスでオープンウェイトとクローズドウェイトモデルを比較するためのベンチマークデータセットと評価スイートを提供します。

ABSTRACT

This paper introduces BioAgent Bench, a benchmark dataset and an evaluation suite designed for measuring the performance and robustness of AI agents in common bioinformatics tasks. The benchmark contains curated end-to-end tasks (e.g., RNA-seq, variant calling, metagenomics) with prompts that specify concrete output artifacts to support automated assessment, including stress testing under controlled perturbations. We evaluate frontier closed-source and open-weight models across multiple agent harnesses, and use an LLM-based grader to score pipeline progress and outcome validity. We find that frontier agents can complete multi-step bioinformatics pipelines without elaborate custom scaffolding, often producing the requested final artifacts reliably. However, robustness tests reveal failure modes under controlled perturbations (corrupted inputs, decoy files, and prompt bloat), indicating that correct high-level pipeline construction does not guarantee reliable step-level reasoning. Finally, because bioinformatics workflows may involve sensitive patient data, proprietary references, or unpublished IP, closed-source models can be unsuitable under strict privacy constraints; in such settings, open-weight models may be preferable despite lower completion rates. We release the dataset and evaluation suite publicly.

研究の動機と目的

AIエージェントに適したエンドツーエンドのバイオインフォマティクスタスクのベンチマークデータセットを提供する。
エージェント対応ワークフローにおけるフロンティアのクローズドソースモデルとオープンウェイトモデルを比較する。
制御された摂動とデータ破損の下でエージェントパイプラインのロバスト性を評価する。
トランスクリプトを記録し、進捗を評価し、アウトカムをスコアリングする評価ハーネスを提供する。
オープンウェイトモデルを強調することでプライバシーを意識した導入を促進する。

提案手法

RNA-seq、バリアントコール、メタゲノミクスなどを網羅するエンドツーエンドのバイオインフォマティクスタスクを定義する。
評価ユニットを形成する作業プロンプトと必要な入力/参照データを用いて、CSVなどの具体的な出力形式を作成する。
Claude Code、Codex CLI、OpenCode を含むハーネスと、ステップ完了および最終成果物を評価するLLMグレーダーを用いてエージェントを評価する。
摂動テスト（破損した入力、デコイ、プロンプト膨張）を組み込み、ロバスト性を評価する。
完了率を主要指標として測定し、計画の質と失敗モードの分析を行う。
タスク別およびモデル別のヒートマップとロバスト性統計を用いて結果を報告する。

Figure 1: An overview of BioAgent Bench. Inputs to LLM agents consist of a task prompt, input data, and reference data. While solving the provided task, an agent can use general-purpose packages or specialized bioinformatics tools. After the agent finishes generation, LLM judge compares its outputs

実験結果

リサーチクエスチョン

RQ1フロンティアのクローズドソースモデルは、最小限のスキャフォールドでエンドツーエンドの多段階バイオインフォマティクスパイプラインを完遂できるか。
RQ2オープンウェイトモデルは、完了率とロバスト性の点でクローズドソースモデルと比較してどうか。
RQ3エージェントベースのバイオインフォマティクスワークフローにおける計画の質とパイプライン完了の関係はどうなるか。
RQ4入力破損、デコイ、プロンプト膨張の下で発生する失敗モードはどのようなものか。
RQ5摂動に対するロバスト性はタスクとハーネス間でどのように異なるか。

主な発見

Task	Trials	Jaccard	Pearson
alzheimer-mouse	4	0.160	0.219
comparative-genomics	4	0.004	NA
cystic-fibrosis	3	1.000	NA
deseq	4	0.978	0.995
evolution	4	0.000	NA
metagenomics	4	0.395	0.746
single-cell	4	0.114	0.395
transcript-quant	4	1.000	1.000
viral-metagenomics	4	0.667	1.000
perturbation-overview	-	-	-

フロンティアモデルはパイプライン完遂率が高く、Claude Opus 4.5 が 100%、Gemini 3 Pro、GPT-5.2、Sonnet 4.5 が 90% 以上。
オープンウェイトモデルは平均的には遅れを取り、Codex CLI での GLM-4.7 が 82.5% の完了、他は約 65% 程度。
計画の質は完了率と相関する（Pearson r = 0.61）が、すべてのモデルで成功を決定論的に予測するわけではない。
ロバスト性テストは、破損入力、デコイ、プロンプト膨張に対してステップレベルの推論が脆弱であることを示し、プロンプト膨張はタスク全体の完了を平均で 28% 減少させた。
クローズドソースモデルはエラーレコードループに陥りやすい一方、フロンティアモデルはより頻繁に回復してパイプラインを完遂する。
オープンウェイトモデルは、完了率が低いにもかかわらず、プライバシー制約のある設定では有利な場合がある。

Figure 2: Model-task completion heatmap. The left panel shows a pairwise completion matrix: rows and columns correspond to models and tasks, respectively, and each cell reports the completion rate (in %) for each model and task pair. Cell color encodes the completion rate, with numeric annotations s

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。