QUICK REVIEW

[论文解读] Transformer Architectures for Respiratory Sound Analysis and Multimodal Diagnosis

Theodore Aptekarev, Vladimir Sokolovsky|arXiv (Cornell University)|Jan 20, 2026

Phonocardiography and Auscultation Techniques被引用 0

一句话总结

该论文将音频光谱变换器（Audio Spectrogram Transformer，AST）应用于呼吸音的哮喘筛查，并评估一个融入结构化患者元数据的多模态视觉-语言模型（VLM），报告高准确性并具备可比的多模态性能。

ABSTRACT

Respiratory sound analysis is a crucial tool for screening asthma and other pulmonary pathologies, yet traditional auscultation remains subjective and experience-dependent. Our prior research established a CNN baseline using DenseNet201, which demonstrated high sensitivity in classifying respiratory sounds. In this work, we (i) adapt the Audio Spectrogram Transformer (AST) for respiratory sound analysis and (ii) evaluate a multimodal Vision-Language Model (VLM) that integrates spectrograms with structured patient metadata. AST is initialized from publicly available weights and fine-tuned on a medical dataset containing hundreds of recordings per diagnosis. The VLM experiment uses a compact Moondream-type model that processes spectrogram images alongside a structured text prompt (sex, age, recording site) to output a JSON-formatted diagnosis. Results indicate that AST achieves approximately 97% accuracy with an F1-score around 97% and ROC AUC of 0.98 for asthma detection, significantly outperforming both the internal CNN baseline and typical external benchmarks. The VLM reaches 86-87% accuracy, performing comparably to the CNN baseline while demonstrating the capability to integrate clinical context into the inference process. These results confirm the effectiveness of self-attention for acoustic screening and highlight the potential of multimodal architectures for holistic diagnostic tools.

研究动机与目标

评估基于变换器的架构是否相较于 CNN 基线在呼吸音的哮喘筛查上有所改进。
将 Audio Spectrogram Transformer (AST) 应用于医学呼吸数据，并在每类数百条记录的数据集上进行微调。
评估一个将光谱图与结构化患者元数据整合的诊断性多模态视觉-语言模型（VLM）。

提出的方法

在包含 1,613 条记录的医学呼吸声音数据集（覆盖哮喘、健康以及其他病理）上微调预训练的 Audio Spectrogram Transformer（AST）。
使用来自多种滑动窗口大小的梅尔频谱输入，转换为 3 通道的类 RGB 图像以输入 AST。
在相同数据分割上将 AST 与 DenseNet201 CNN 基线进行对比。
开发一个 Moondream 风格的视觉-语言模型（VLM），接受基于光谱图的图像、结构化元数据以及一个指令提示，以输出 JSON 诊断结果。
使用低秩适配（LoRA）适配器对 VLM 进行微调，核心权重保持冻结，并训练最终分类头。
评估 5 秒与 10 秒输入时长，最终选择 5 秒用于最终评估。

实验结果

研究问题

RQ1AST 是否能在呼吸音哮喘筛查中提供比 CNN 基线更高的准确度？
RQ2将光谱图与结构化元数据相结合的多模态 VLM 是否能提升诊断性能，或在某些情形与传统 CNN 相当？
RQ3在多模态设置中纳入临床背景（年龄、性别、录音地点）对哮喘分类有何影响？
RQ4在 CPU/GPU 上，AST 与 VLM 的实际推理效率在临床部署中的表现如何？

主要发现

AST 达到约 97% 的准确率、约 97% 的 F1，哮喘 vs 非哮喘的 ROC AUC 为 0.98，优于 CNN 基线。
VLM 在哮喘 vs 非哮喘任务上实现约 86-87% 的准确性，与 DenseNet 基线在 Youden 指数上相当。
消融研究显示去除元数据会导致性能显著下降；文本条件对稳定的 VLM 推理至关重要。
AST 使用 5 秒剪辑也能获得与 10 秒剪辑相当的性能，同时增加训练样本量。
DenseNet 基线在相同任务上准确度约 87%、灵敏度约 93%、特异度约 82-86%，可作为参考点。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。