[論文レビュー] Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Lingshu は、包括的なデータ編成と統合パイプラインを用いて訓練された医療ドメイン全般対応のマルチモーダル基盤モデルであり、マルチモーダルおよびテキスト医療タスクで最先端の成果を達成し、RLVR 活性化による医療推論を可能にします。
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...
研究の動機と目的
- Improve medical multimodal understanding beyond imaging by incorporating extensive medical texts and general-domain data.
- Curate and synthesize high-quality medical captions, VQA, and chain-of-thought data to reduce hallucinations and enhance reasoning.
- Develop Lingshu and Lingshu-RL through a staged training regime to infuse medical knowledge progressively.
- Create MedEvalKit to standardize evaluation across medical multimodal benchmarks.
- Demonstrate strong performance across medical VQA, text-based QA, and medical report generation.
提案手法
- Build on the Qwen2.5-VL architecture with 7B and 32B parameter variants as baselines.
- Develop a four-stage training pipeline: Medical Shallow Alignment, Medical Deep Alignment, Medical Instruction Tuning, and Medical-oriented Reinforcement Learning.
- Assemble a large, diverse data corpus including medical multimodal data, medical text, and general-domain data, plus synthetic long-form captions, VQA, OCR-based data, and CoT reasoning samples.
- Apply rigorous data cleaning (image/text deduplication, token-based filtering) and modality labeling (BiomedCLIP) to ensure data quality.
- Explore RLVR for medical reasoning to create Lingshu-RL.
- Provide MedEvalKit to unify and standardize evaluation across major medical benchmarks.
実験結果
リサーチクエスチョン
- RQ1How can a medical-focused multimodal foundation model be trained to integrate extensive medical knowledge from imaging, text, and general-domain data?
- RQ2What data curation and synthesis strategies reduce hallucinations and enhance medical reasoning in MLLMs?
- RQ3Can a staged training pipeline (shallow to deep alignment, instruction tuning, and RL-based reasoning) yield state-of-the-art performance on medical VQA and report generation?
- RQ4How does a unified evaluation framework (MedEvalKit) enable fair, standardized assessment across medical multimodal benchmarks?
- RQ5What is the impact of reinforcement learning with verifiable rewards on medical reasoning capabilities?
主な発見
- Lingshu は、7B および 32B 構成の両方で、複数のマルチモーダルおよびテキスト医療 VQA タスクとレポート生成において最先端の性能を達成。
- Lingshu-32B は seven medical VQA タスクにおいてセカンドベストモデルを平均7.2ポイントの精度で上回り、GPT-4.1 や Claude Sonnet 4 のような独自モデルを凌駕。
- 長文キャプション、OCR データ、VQA、CoT 推論を含む厳格なデータ編成と統合パイプラインが、知識の域を拡げ、幻覚を減らすことを示す。
- 統一された MedEvalKit フレームワークは、主要ベンチマークを一本化して医療 AI の標準化・効率的な評価を実現。
- ケーススタディは医療レポート生成、臨床サポート、手術補助などの実践的適用性を示す。
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。