QUICK REVIEW

[論文レビュー] UltrasoundAgents: Hierarchical Multi-Agent Evidence-Chain Reasoning for Breast Ultrasound Diagnosis

Yali Zhu, Kang Zhou|arXiv (Cornell University)|Mar 11, 2026

AI in cancer detection被引用数 0

ひとこと要約

要約: 層状の二エージェントフレームワークを提案（局在と診断を担うメインエージェント、細粒 Attributes を担うサブエージェント）で、学習をデカップリングし軌道自己蒸留を行い、検証可能な証拠を生成して乳腺超音波の BI-RADS および悪性予測を改善する

ABSTRACT

Breast ultrasound diagnosis typically proceeds from global lesion localization to local sign assessment and then evidence integration to assign a BI-RADS category and determine benignity or malignancy. Many existing methods rely on end-to-end prediction or provide only weakly grounded evidence, which can miss fine-grained lesion cues and limit auditability and clinical review. To align with the clinical workflow and improve evidence traceability, we propose a hierarchical multi-agent framework, termed UltrasoundAgents. A main agent localizes the lesion in the full image and triggers a crop-and-zoom operation. A sub-agent analyzes the local view and predicts four clinically relevant attributes, namely echogenicity pattern, calcification, boundary type, and edge (margin) morphology. The main agent then integrates these structured attributes to perform evidence-based reasoning and output the BI-RADS category and the malignancy prediction, while producing reviewable intermediate evidence. Furthermore, hierarchical multi-agent training often suffers from error propagation, difficult credit assignment, and sparse rewards. To alleviate this and improve training stability, we introduce a decoupled progressive training strategy. We first train the attribute agent, then train the main agent with oracle attributes to learn robust attribute-based reasoning, and finally apply corrective trajectory self-distillation with spatial supervision to build high-quality trajectories for supervised fine-tuning, yielding a deployable end-to-end policy. Experiments show consistent gains over strong vision-language baselines in diagnostic accuracy and attribute agreement, together with structured evidence and traceable reasoning.

研究の動機と目的

臨床的な粗から細へのワークフローを、病変の局在と属性認識・診断を分離することで模倣する。
RO I -> attributes -> BI-RADS/malignancy の検証可能な証拠チェーンを提供し、堅牢で追跡可能な推論を確立する。
階層的強化学習における学習安定性を、オラクル誘導カリキュラムと軌道自己蒸留で改善する。
公開 BUS データセット上で診断精度と属性一致を改善し、OOD一般化能力を向上させる。

提案手法

局在推定と証拠統合を行うメインエージェント A_M と、Crop-and-zoom 視点で局所属性認識を行うサブエージェント A_S からなる二エージェントアーキテクチャ。
三段階の学習: Stage 1 で A_S を RL で訓練し four clinical attributes（echogenicity, calcification, boundary type, edge）を予測し、解釈可能な traces を生成。
Stage 2 で A_M を curriculum RL による GT 属性を用いて訓練し、知覚ノイズに依存せず高レベルの推論を安定化させる。
Stage 3 で trajectory refinement と SFT を実施し、修正された軌道自己蒸留でデプロイ可能なエンドツーエンド方針を作成。
Explicit ROI -> attributes -> diagnosis の証拠チェーンを、crop-and-zoom が構造化された証拠をメインエージェントへ供給。
評価は AUROC、accuracy、BI-RADS accuracy、Cohen’s κ を複数の BUS データセットで用い、GTbox/GTattr 上限と局所化分析を含むアブレーションを実施。

Figure 1 : Hierarchical multi-agent architecture. The main agent analyzes the full image to localize the lesion, triggers a crop-and-zoom operation, and queries the sub-agent on the zoomed view to obtain structured attribute evidence. The main agent then integrates the global context and the attribu

実験結果

リサーチクエスチョン

RQ1階層的マルチエージェントシステムは、エンドツーエンドモデルと比較して乳腺超音波診断の解釈性と追跡性を改善できるか？
RQ2crop-and-zoom と明示的な属性推論は局所化可能な証拠と下流の BI-RADS および悪性予測を改善するか？
RQ3オラクル誘導カリキュラム RL と修正軌道自己蒸留は訓練安定性と最終ポリシー性能にどのように影響するか？
RQ4主要な誤差要因（局所化 vs 属性ノイズ）は何で、in-domain および out-of-domain の性能にどう影響するか？

主な発見

Method	BUSBRA AUC	BUSBRA Acc	BUSBRA Bi-Acc	BUSBRA κ	BUSI AUC	BUSI Acc	BUSI Bi-Acc	BUSI κ	BUDIAT AUC	BUDIAT Acc	BUDIAT Bi-Acc	BUDIAT κ	BrEaST (OOD) AUC	BrEaST (OOD) Acc	BrEaST (OOD) Bi-Acc	BrEaST (OOD) κ	Overall AUC	Overall Acc	Overall Bi-Acc	Overall κ
Qwen2.5-3B-Zero-Shot	0.458	0.588	0.091	0.003	0.493	0.573	0.156	0.066	0.500	0.743	0.200	-0.053	0.484	0.608	0.078	-0.065	0.476	0.602	0.117	0.014
Qwen2.5-3B-COT-SFT	0.668	0.722	0.563	0.258	0.835	0.833	0.460	0.245	0.722	0.857	0.400	0.129	0.586	0.627	0.176	0.060	0.71	0.751	0.468	0.204
Think-with-Image	0.500	0.693	0.048	0.003	0.528	0.660	0.080	0.019	0.500	0.743	0.314	-0.001	0.526	0.647	0.196	0.005	0.512	0.683	0.101	0.004
ours	0.723	0.813	0.620	0.300	0.784	0.833	0.542	0.244	0.778	0.886	0.400	0.145	0.685	0.725	0.157	0.037	0.741	0.813	0.515	0.224

提案手法は、比較対象ベースラインの中で最も良い in-domain 診断性能を達成（AUC 0.741、Acc 0.813、Bi-Acc 0.515、κ 0.224）。
ROI -> attribute -> diagnosis の証拠連鎖と crop-and-zoom は属性証拠の質と診断の一貫性をベースラインより向上させる。
Oracle-guided curriculum RL は性能と局所化の整合を大幅に向上させる；これを除去すると AUC が 0.535、κ が 0.018 へ低下。
修正軌道自己蒸留は IoU と診断精度を大幅に改善し、全体の IoU は 0.299 から 0.610、AUC は 0.726 から 0.741 に上昇。
OA 分析により GTbox および GTattr の上限がそれぞれ AUC を最大 0.782、0.804 に示し、局所化と属性ノイズが BI-RADS の一貫性の主要ボトルネックであることを示唆。
病変クロップは一般に属性 F1 スコア（Boundary、Edge、Echo）を高くし、全体像より高い細粒度証拠の有効性を支持。

Figure 2 : Three-stage training. Stage 1: RL trains $A_{S}$ for attribute recognition. Stage 2: oracle-guided RL trains $A_{M}$ with GT attributes for stable reasoning. Stage 3: refine trajectories and distill via SFT for robustness.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。