QUICK REVIEW

[論文レビュー] LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Jon M. Laurent, Joseph D. Janizek|arXiv (Cornell University)|Jul 14, 2024

Biomedical Text Mining and Ontologies被引用数 14

ひとこと要約

LAB-Bench は、実際の生物学研究タスクに対して frontier ランゲージモデルを評価する大規模なマルチタスク・ベンチマーク（2,400 問以上のMCQ）であり、文献の recalled、図表の解釈、データベースアクセス、プロトコル作成、DNA/タンパク質配列の操作などを含み、人間との比較および公開サブセットを提供します。

ABSTRACT

There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any benchmarks are designed to evaluate language model performance on practical tasks required for scientific research, such as literature search, protocol planning, and data analysis. As a step toward building such benchmarks, we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences. Importantly, in contrast to previous scientific benchmarks, we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning. As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers. We will continue to update and expand LAB-Bench over time, and expect it to serve as a useful tool in the development of automated research systems going forward. A public subset of LAB-Bench is available for use at the following URL: https://huggingface.co/datasets/futurehouse/lab-bench

研究の動機と目的

教科書の問題を超えた実践的な生物学研究タスクを実行する frontier LLM とその能力を評価する。
文献、図、表、データベース、プロトコル、配列の recall、推論、操作を横断して評価する。
モデルの性能を博士課程レベルの生物学者と比較し、ツール統合やディストラクター設計の改善が必要なギャップを特定する。
コミュニティ利用の公開サブセットを提供し、今後の AI 支援生物学ワークフローのベンチマークを概説する。

提案手法

LitQA2、SuppQA、FigQA、TableQA、DbQA、ProtocolQA、SeqQA、CloningScenarios にまたがる 2,400 問以上のデータセットを構築する。
難易度の高いカテゴリには manually expert による生成を、スケーラブルなタスクには programmatic な生成を組み合わせる。
情報が不足している場合には、0-shot チェーン・オブ・想起 prompting を用いて frontier モデルをツールなしで評価し、回答を拒否することを許す。
選択されたサブセットで人間の生物学 PhD とモデルの性能を比較し、正確さと精度の指標を報告する。
再現性のあるベンチマークを実現するために、プロンプト、コード、公開データサブセットを提供する。

Figure 1: Sample questions for each of the categories provided in this work. Note that DbQA and SeqQA consist of many different subtasks, and only one task is presented here. The font size of the distance annotations in the FigQA example have been increased for legibility here.

実験結果

リサーチクエスチョン

RQ1外部ツールを使用せずに、 frontier ランゲージモデルは実践的な生物学研究タスクをどのように遂行するか？
RQ2LitQA2、SuppQA、FigQA、TableQA、DbQA、ProtocolQA、SeqQA、CloningScenarios におけるモデルと人間の専門家の性能差はどこにあるか？
RQ3これらのタスクでモデルは retrieval、推論、試験対策のいずれにどの程度依存しているか？
RQ4配列操作とクローニングワークフローの評価において、モデルの性能は人間の研究室とどのように比較されるか？

主な発見

カテゴリ	サブタスク #	質問 #	人間の適合率
LitQA2	-	248	100%
SuppQA	-	102	100%
FigQA	-	226	100%
TableQA	-	305	82%
DbQA	10	650	35%
ProtocolQA	-	135	100%
SeqQA	15	750	64%
CloningScenarios	-	41	100%
Total	-	2,457	69%

モデルは LAB-Bench のタスク間で大きなばらつきを示し、いくつかのカテゴリで回答意欲が高い一方で、検索依存のタスクでは大幅な拒否が見られる。
LitQA2 の質問は retrieval を組み合わせた設定でランダム以上の性能を示すが、retrieval なしでは frontier モデルの一部で機会を超える性能が低下する。
FigQA と DbQA は特に難しく、多くのモデルでほぼランダムな精度に留まる（TableQA の Claude 3.5 Sonnet などを除く）。
SeqQA の全体精度は 40-50% 程度であり、より簡単なプライマー設計タスクでは 90% を超える精度を示すサブタスクもある。
Cloning Scenarios は人間の性能を大きく下回り、実世界の分子クローニングの複雑な推論に依然として大きなギャップがある。
ほとんどのタスクで人間はモデルを一貫して上回るが、いくつかのタスクではギャップが小さくなる（例：Claude 3.5 Sonnet の TableQA）。”]
table_headers：

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。