QUICK REVIEW

[论文解读] StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks

Yishan Wang, Tsai-Ning Wang|arXiv (Cornell University)|Feb 27, 2026

Phonocardiography and Auscultation Techniques被引用 0

一句话总结

StethoLM 是一个专注于心肺听诊的音频–语言模型，执行七项指令驱动的临床任务，基于 StethoBench 训练，包含来自 16,125 录音的 77,027 条指令–响应对。

ABSTRACT

Listening to heart and lung sounds - auscultation - is one of the first and most fundamental steps in a clinical examination. Despite being fast and non-invasive, it demands years of experience to interpret subtle audio cues. Recent deep learning methods have made progress in automating cardiopulmonary sound analysis, yet most are restricted to simple classification and offer little clinical interpretability or decision support. We present StethoLM, the first audio-language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. StethoLM integrates audio encoding with a medical language model backbone and is trained on StethoBench, a comprehensive benchmark comprising 77,027 instruction-response pairs synthesized from 16,125 labeled cardiopulmonary recordings spanning seven clinical task categories: binary classification, detection, reporting, reasoning, differential diagnosis, comparison, and location-based analysis. Through multi-stage training that combines supervised fine-tuning and direct preference optimization, StethoLM achieves substantial gains in performance and robustness on out-of-distribution data. Our work establishes a foundation for instruction-following AI systems in clinical auscultation.

研究动机与目标

推动可扩展的、指令驱动的听诊分析，以克服心肺声音中仅分类方法的局限。
开发面向细粒度心肺声学与临床工作流的音频–语言模型。
创建 StethoBench，提供覆盖七项临床任务的多样化多任务基准。
证明专门化训练在提高对分布外数据的鲁棒性方面的作用。

提出的方法

提出 StethoLM——一个音频编码器 + 投影网络 + 语言模型骨干，将音频特征映射为便于文本生成的前缀令牌以进行文本条件化。
通过有监督微调（SFT）并使用 LoRA 对医疗 LLM 骨干进行高效适应。
探索 Direct Preference Optimization（DPO）与多模态 DPO（mDPO），以提升在降噪音频场景下的响应质量。
通过将七个心肺数据集转换为 77,027 条指令–响应对，覆盖七种任务类型来构建 StethoBench。
采用两阶段训练方案（SFT 后跟 (m)DPO），并以临床导向指标评估，包括 BERTScore 和 LLM 评判的临床准确性。
在领域内数据和分布外数据上进行评估，以评估鲁棒性和泛化能力。

Figure 1: Overview of StethoLM and StethoBench. A. Automated benchmark creation pipeline, where off-the-shelf LLMs generate 77,027 task–response pairs from 16,125 cardiopulmonary recordings and associated annotations. B. Distribution of audio type and the examples of disease that StethoLM covers. C.

实验结果

研究问题

RQ1一个专门针对心肺听诊的音频–语言模型是否能够执行超越分类的多任务、指令驱动的临床推理？
RQ2在医学音频上的领域特定训练是否比通用音频–语言模型在领域内和分布外数据上具有更好表现？
RQ3StethoLM 在七个临床任务类别（二分类、检测、报告、推理、鉴别诊断、比较、基于位置的分析）中的表现如何？

主要发现

StethoLM 在领域内数据的多项任务上显著优于通用多模态和音频–语言基线。
StethoLM 在分布外数据集上表现出更强的鲁棒性，表明对部署场景具有更好的泛化性。
专门化的指令驱动训练（SFT，可能包含 DPO/mDPO）在相对于在通用音频任务上训练的骨干模型上带来性能提升。
StethoBench 提供了覆盖 16,125 条录音、由此衍生的 77,027 条指令–响应对的全面基准，能够进行超越简单分类的评估。

Figure 2: Diverse clinical tasks supported by StethoLM. Instructions (left) represent realistic clinical queries, while responses (right) provide task-appropriate outputs ranging from binary decisions to complex diagnostic reasoning.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。