QUICK REVIEW

[논문 리뷰] Transformer Architectures for Respiratory Sound Analysis and Multimodal Diagnosis

Theodore Aptekarev, Vladimir Sokolovsky|arXiv (Cornell University)|2026. 01. 20.

Phonocardiography and Auscultation Techniques인용 수 0

한 줄 요약

이 논문은 Audio Spectrogram Transformer (AST)를 호흡 소리에서 천식 선별에 적용하고, 구조화된 환자 메타데이터를 포함하는 멀티모달 Vision-Language Model (VLM)을 평가하며 높은 정확도와 유사한 멀티모달 성능을 보고한다.

ABSTRACT

Respiratory sound analysis is a crucial tool for screening asthma and other pulmonary pathologies, yet traditional auscultation remains subjective and experience-dependent. Our prior research established a CNN baseline using DenseNet201, which demonstrated high sensitivity in classifying respiratory sounds. In this work, we (i) adapt the Audio Spectrogram Transformer (AST) for respiratory sound analysis and (ii) evaluate a multimodal Vision-Language Model (VLM) that integrates spectrograms with structured patient metadata. AST is initialized from publicly available weights and fine-tuned on a medical dataset containing hundreds of recordings per diagnosis. The VLM experiment uses a compact Moondream-type model that processes spectrogram images alongside a structured text prompt (sex, age, recording site) to output a JSON-formatted diagnosis. Results indicate that AST achieves approximately 97% accuracy with an F1-score around 97% and ROC AUC of 0.98 for asthma detection, significantly outperforming both the internal CNN baseline and typical external benchmarks. The VLM reaches 86-87% accuracy, performing comparably to the CNN baseline while demonstrating the capability to integrate clinical context into the inference process. These results confirm the effectiveness of self-attention for acoustic screening and highlight the potential of multimodal architectures for holistic diagnostic tools.

연구 동기 및 목표

변환기 기반 아키텍처가 CNN 기준선을 넘어 호흡 소리에서 천식 선별을 개선하는지 평가합니다.
Audio Spectrogram Transformer (AST)를 의료용 호흡 데이터에 적응시키고 클래스당 수백 개의 녹음을 포함한 데이터셋에서 파인튜닝합니다.
스펙트로그램과 구조화된 환자 메타데이터를 융합한 멀티모달 Vision-Language Model (VLM)을 진단에 대해 평가합니다.

제안 방법

천식, 건강, 기타 병태를 포함한 1,613개의 녹음을 가진 의료용 호흡 소리 데이터셋에서 사전 학습된 Audio Spectrogram Transformer (AST)을 미세조정합니다.
여러 창 크기에서 멜-스펙트로그램 입력을 3채널 RGB 유사 이미지로 변환하여 AST에 사용합니다.
동일한 데이터 분할에서 AST를 DenseNet201 CNN 기준선과 비교합니다.
스펙트로그램에서 도출된 이미지와 구조화된 메타데이터, 진단을 출력하기 위한 지시 프롬프트를 입력으로 받는 Moondream형 Vision-Language Model (VLM)을 개발합니다.
핵심 가중치를 고정하고 LoRA 어댑터로 VLM을 파인튜닝하며 최종 분류기 헤드를 학습합니다.
5초 및 10초 입력 지속 시간을 평가하고 최종 평가에는 5초를 선택합니다.

실험 결과

연구 질문

RQ1호흡 소리에서 천식 선별에 대해 AST가 CNN 기준선보다 더 높은 정확도를 제공할 수 있나요?
RQ2스펙트로그램과 구조화된 메타데이터를 결합한 멀티모달 VLM이 진단 성능을 향상시키거나 기존 CNN과 경쟁력 있는 결과를 제공합니까?
RQ3임상 맥락(나이, 성별, 녹음 장소)을 포함하는 것이 멀티모달 설정에서 천식 분류에 어떤 영향을 미칩니까?
RQ4임상 배치를 위한 CPU/GPU에서의 AST 및 VLM의 실제 추론 효율성은 어느 정도입니까?

주요 결과

AST는 Asthma vs Not Asthma에서 약 97%의 정확도, 약 97%의 F1, ROC AUC 0.98을 달성하여 CNN 기준선을 능가했습니다.
VLM은 Asthma vs Not Asthma에서 약 86-87%의 정확도를 달성했으며 Youden 지수에서 DenseNet 기준선과 비슷합니다.
특정 메타데이터를 제거하면 성능 저하가 심하게 나타났고, 텍스트 컨디셔닝은 안정적인 VLM 추론에 필수적임을 확인했습니다.
AST는 5초 클립에서 뛰어난 성능을 보이며 10초 클립과의 성능을 따라가고 학습 샘플 크기를 늘립니다.
동일 작업에서 DenseNet 기준선 정확도 약 87%, 민감도 약 93%, 특이도 약 82-86%로 참고점으로 작용합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.