QUICK REVIEW

[논문 리뷰] StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks

Yishan Wang, Tsai-Ning Wang|arXiv (Cornell University)|2026. 02. 27.

Phonocardiography and Auscultation Techniques인용 수 0

한 줄 요약

StethoLM은 심폐 청진에 특화된 음성–언어 모델로서 7개의 지시형 임상 작업을 수행하며, 16,125개의 녹음에서 77,027개의 지시–응답 쌍으로 구성된 StethoBench에서 학습되었습니다.

ABSTRACT

Listening to heart and lung sounds - auscultation - is one of the first and most fundamental steps in a clinical examination. Despite being fast and non-invasive, it demands years of experience to interpret subtle audio cues. Recent deep learning methods have made progress in automating cardiopulmonary sound analysis, yet most are restricted to simple classification and offer little clinical interpretability or decision support. We present StethoLM, the first audio-language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. StethoLM integrates audio encoding with a medical language model backbone and is trained on StethoBench, a comprehensive benchmark comprising 77,027 instruction-response pairs synthesized from 16,125 labeled cardiopulmonary recordings spanning seven clinical task categories: binary classification, detection, reporting, reasoning, differential diagnosis, comparison, and location-based analysis. Through multi-stage training that combines supervised fine-tuning and direct preference optimization, StethoLM achieves substantial gains in performance and robustness on out-of-distribution data. Our work establishes a foundation for instruction-following AI systems in clinical auscultation.

연구 동기 및 목표

심폐 소리에서 분류 중심의 접근법을 극복하기 위한 확장 가능한 지시 기반의 청진 분석의 필요성을 제시한다.
정밀한 심폐 음향과 임상 워크플로우에 맞춘 음성–언어 모델을 개발한다.
StethoBench를 만들어 일곱 가지 임상 작업을 포괄하는 다양한 다중 작업 벤치마크를 제공한다.
전문화된 학습이 분포 외(out-of-distribution) 데이터에 대한 견고성을 향상시킨다는 것을 보여준다.

제안 방법

StethoLM을 제안한다. 음성 인코더 + 투영 네트워크 + 언어모델 백본으로 구성되어 음성 특징을 텍스트 생성을 위한 언어 친화적 프리픽스 토큰으로 매핑한다.
LoRA를 활용한 감독식 미세조정(SFT)을 사용하여 의학 LLM 백본의 효율적인 적응을 학습한다.
Direct Preference Optimization (DPO) 및 다중모달 DPO (mDPO)를 탐구하여 응답 품질을 개선하고, 음질 저하 시나리오를 포함한다.
7개의 심폐 데이터셋을 77,027개의 지시–응답 쌍으로 변환하여 7가지 작업 유형을 포괄하는 StethoBench를 구성한다.
두 단계 트레이닝(숙련된 SFT 후 (m)DPO)을 사용하고 BERTScore 및 LLM이 판단하는 임상 정확도 등 임상 지향 지표로 평가한다.
도메인 내외 데이터에서 평가하여 견고성과 일반화를 평가한다.

Figure 1: Overview of StethoLM and StethoBench. A. Automated benchmark creation pipeline, where off-the-shelf LLMs generate 77,027 task–response pairs from 16,125 cardiopulmonary recordings and associated annotations. B. Distribution of audio type and the examples of disease that StethoLM covers. C.

실험 결과

연구 질문

RQ1심폐 청진에 특화된 음성–언어 모델이 분류를 넘어 다중 작업의 지시 기반 임상 추론을 수행할 수 있는가?
RQ2의료 음향에 대한 도메인 특화 학습이 일반 목적의 음성–언어 모델보다 도메인 내외 데이터에서 더 우수한 성능을 보이는가?
RQ3StethoLM은 7가지 임상 작업 범주(이진 분류, 탐지, 보고, 추론, 감별 진단, 비교, 위치 기반 분석)에서 어떻게 수행되는가?

주요 결과

StethoLM은 도메인 내 데이터에서 여러 작업에 걸쳐 일반-purpose 다모달 및 음성–언어 기준선보다 현저히 우수하다.
StethoLM은 분포 외 데이터셋에서 견고성이 향상되어 배치 시나리오에 대한 일반화가 더 잘 된다.
전문화된 지시 기반 학습(SFT, 잠재적 DPO/mDPO 포함)이 일반 오디오 작업에 대해 학습된 백본보다 성능 향상을 가져온다.
StethoBench는 16,125개의 녹음에서 파생된 77,027개의 지시–응답 쌍으로 포괄적인 벤치마크를 제공하여 단순 분류를 넘어 평가를 가능하게 한다.

Figure 2: Diverse clinical tasks supported by StethoLM. Instructions (left) represent realistic clinical queries, while responses (right) provide task-appropriate outputs ranging from binary decisions to complex diagnostic reasoning.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.