QUICK REVIEW

[論文レビュー] Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare Benchmark

Lalitha Pranathi Pulavarthy, Raajitha Muthyala|arXiv (Cornell University)|Feb 23, 2026

Machine Learning in Healthcare被引用数 0

ひとこと要約

この論文は、人間が導くエージェント型AIが三つのAgentDSベンチマーク課題における多模態臨床予測をどのように改善するかを検討し、ドメイン情報を活用した特徴エンジニアリング、多模态統合、そして多様なアンサンブルが再現性の高い堅牢な結果をもたらすことを示しています。

ABSTRACT

Agentic AI systems are increasingly capable of autonomous data science workflows, yet clinical prediction tasks demand domain expertise that purely automated approaches struggle to provide. We investigate how human guidance of agentic AI can improve multimodal clinical prediction, presenting our approach to all three AgentDS Healthcare benchmark challenges: 30-day hospital readmission prediction (Macro-F1 = 0.8986), emergency department cost forecasting (MAE = $465.13), and discharge readiness assessment (Macro-F1 = 0.7939). Across these tasks, human analysts directed the agentic workflow at key decision points, multimodal feature engineering from clinical notes, scanned PDF billing receipts, and time-series vital signs; task-appropriate model selection; and clinically informed validation strategies. Our approach ranked 5th overall in the healthcare domain, with a 3rd-place finish on the discharge readiness task. Ablation studies reveal that human-guided decisions compounded to a cumulative gain of +0.065 F1 over automated baselines, with multimodal feature extraction contributing the largest single improvement (+0.041 F1). We distill three generalizable lessons: (1) domain-informed feature engineering at each pipeline stage yields compounding gains that outperform extensive automated search; (2) multimodal data integration requires task-specific human judgment that no single extraction strategy generalizes across clinical text, PDFs, and time-series; and (3) deliberate ensemble diversity with clinically motivated model configurations outperforms random hyperparameter search. These findings offer practical guidance for teams deploying agentic AI in healthcare settings where interpretability, reproducibility, and clinical validity are essential.

研究の動機と目的

臨床予測タスクにおいてドメイン知識と解釈性を要する人間-AI協働を動機づける。
エージェント型AIが日常的なデータサイエンス作業を処理し、人間が重要な意思決定を導くワークフローを開発する。
ドメイン情報を活用した特徴エンジニアリングと選定されたモデルアンサンブルを通じて、AgentDSの3つの医療課題で性能向上を実証する。
臨床的妥当性と監査性を保証するための人間の意思決定点の再現可能なドキュメンテーションを提供する。

提案手法

エージェント型AIがデータの読み込み、前処理、基礎モデル化を実施する反復的な人間-AI協働で、人間は鍵となる意思決定点で介入する。
臨床知識に guided by clinical knowledgeられた構造化データ、非構造化ノート、PDF、時系列バイタルを跨る多模态特徴抽出を実施する。
自動ハイパーパラメータ探索に依存するのではなく、人間が情報を与えた設定で木構造モデルと線形モデルのモデルアンサンブルを適用する。
ネストしたクロスバリデーションとホールドアウト検証を含む厳格な検証戦略を採用し、再現性を確保し過学習を回避する。

実験結果

リサーチクエスチョン

RQ1臨床予測タスクにおいて重要な意思決定点での人間のガイダンスは、自動エージェント型ワークフローを超える性能向上をもたらすか。
RQ2リードミッション、EDコスト、退院準備のタスクで、どのタイプの人間主導の特徴エンジニアリングとデータ統合が最大の改善をもたらすか。
RQ3ドメイン情報を取り入れた多様なアンサンブルは、小〜中規模の臨床データセットにおいて自動化されたハイパーパラメータ最適化よりも優れているか。
RQ4クロスモーダルデータ統合は医療予測の予測性能と解釈性にどのように影響するか。

主な発見

Challenge	Metric	Our Score	Rank	1st Place
Challenge 1: Readmission	Macro-F1	0.8986	5th	0.9044
Challenge 2: ED Cost	MAE (USD)	$465.13	6th	$448.75
Challenge 3: Discharge Readiness	Macro-F1	0.7939	3rd	0.8006

人間主導の意思決定により、タスク間でエージェント基準のベースラインから累積で+0.065のF1改善を達成した。
多模态特徴抽出がChallenge 1で最大の単独寄与を+0.041F1としてもたらした。
すべてのデータタイプに共通する単一の抽出戦略は存在せず、テキスト、PDF、時系列データにはタスク特異的でドメイン情報に基づくアプローチが不可欠であった。
タスク適合のモデル設定を備えた deliberateなアンサンブルの多様性は、ランダムなハイパーパラメータ探索を上回った。
タスク間の一貫性が見られた：総合的な医療ランキング上位5位で、タスクごとの順位は5位、6位、3位。
検証スコアはテストリーダーボード結果と密接に一致し、堅牢な一般化を示唆している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。