QUICK REVIEW

[논문 리뷰] Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

LASA Team, Weiwen Xu|ArXiv.org|2025. 06. 08.

Multimodal Machine Learning Applications인용 수 4

한 줄 요약

Lingshu는 포괄적 데이터 선별 및 합성 파이프라인으로 훈련된 의료 분야 일반ist 다중모달 기반 모델로, 다중모달 및 텍스트 의학 작업에서 최첨단 성능을 달성하고 RLVR로 강화된 의학적 추론을 가능하게 한다.

ABSTRACT

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...

연구 동기 및 목표

영상 외의 의학 다중모달 이해를 확장하기 위해 광범위한 의학 텍스트와 일반 도메인 데이터를 포함한다.
고품질 의학 캡션, VQA 및 chain-of-thought 데이터를 선별하고 합성하여 환각을 줄이고 추론을 향상시킨다.
의학 지식을 단계적으로 주입하기 위한 단계적 학습 체계를 통해 Lingshu 및 Lingshu-RL을 개발한다.
의학 다중모달 벤치마크 전반에서 평가를 표준화하기 위해 MedEvalKit을 만든다.
의학 VQA, 텍스트 기반 QA, 및 의학 보고서 생성 전반에서 강력한 성능을 보여준다.

제안 방법

7B 및 32B 매개변수 변형을 기준으로 Qwen2.5-VL 아키텍처를 기반으로 한다.
의학 얕은 정렬, 의학 깊은 정렬, 의학 지시 조정, 및 의학 지향 강화 학습으로 구성된 4단계 학습 파이프라인을 개발한다.
의학 다중모달 데이터, 의학 텍스트 및 일반 도메인 데이터를 포함한 크고 다양한 데이터 말뭉치를 구성하고, 여기에 합성 장문 캡션, VQA, OCR 기반 데이터 및 CoT 추론 샘플을 더한다.
데이터 품질 보장을 위해 엄격한 데이터 정리(이미지/텍스트 중복 제거, 토큰 기반 필터링) 및 모달리티 라벨링(BiomedCLIP)을 적용한다.
의학 추론을 위한 RLVR을 탐구하여 Lingshu-RL을 만든다.
주요 의학 벤치마크 전반에 걸친 평가를 통합하고 표준화하기 위해 MedEvalKit을 제공한다.

실험 결과

연구 질문

RQ1영상, 텍스트, 일반 도메인 데이터로부터 방대한 의학 지식을 통합하도록 의학 중심 다중모달 기초 모델을 어떻게 학습시킬 수 있는가?
RQ2MLLMs에서 환각을 줄이고 의학적 추론을 향상시키기 위한 데이터 선별 및 합성 전략은 무엇인가?
RQ3얕은 정렬에서 깊은 정렬, 지시 조정 및 RL 기반 추론의 단계적 학습 파이프라인이 의학 VQA 및 보고서 생성에서 최첨단 성능을 낼 수 있는가?
RQ4통합 평가 프레임워크(MedEvalKit)가 의학 다중모달 벤치마크 전반에서 공정하고 표준화된 평가를 어떻게 가능하게 하는가?
RQ5검증 가능한 보상으로의 강화 학습이 의학적 추론 역량에 미치는 영향은 무엇인가?

주요 결과

Lingshu는 7B 및 32B 구성 모두에서 다수의 다중모달 및 텍스트 의학 VQA 작업과 보고서 생성에서 최첨단 성능을 달성한다.
Lingshu-32B는 일곱 개의 의학 VQA 작업에서 두 번째로 높은 모델보다 평균 7.2 정확도 포인트 앞서며, GPT-4.1 및 Claude Sonnet 4와 같은 독점 모델을 능가한다.
장문 캡션, OCR 데이터, VQA 및 CoT 추론을 포함하는 엄격한 데이터 선별 및 합성 파이프라인이 도메인 지식 커버리지 향상과 환각 감소를 가져오는 것으로 나타났다.
통합 MedEvalKit 프레임워크가 주요 벤치마크를 통합하여 의학 AI에서 표준화되고 효율적인 모델 평가를 가능하게 한다.
사례 연구는 의학 보고서 생성, 임상 지원 및 수술 보조 분야에서의 실용적 적용 가능성을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.