QUICK REVIEW

[論文レビュー] Transformer Architectures for Respiratory Sound Analysis and Multimodal Diagnosis

Theodore Aptekarev, Vladimir Sokolovsky|arXiv (Cornell University)|Jan 20, 2026

Phonocardiography and Auscultation Techniques被引用数 0

ひとこと要約

論文は呼吸音から喘息スクリーニングのために Audio Spectrogram Transformer (AST) を適用し、構造化された患者メタデータを統合するマルチモーダル Vision-Language Model (VLM) を評価して、高い精度とマルチモーダル性能の比較可能性を報告します。

ABSTRACT

Respiratory sound analysis is a crucial tool for screening asthma and other pulmonary pathologies, yet traditional auscultation remains subjective and experience-dependent. Our prior research established a CNN baseline using DenseNet201, which demonstrated high sensitivity in classifying respiratory sounds. In this work, we (i) adapt the Audio Spectrogram Transformer (AST) for respiratory sound analysis and (ii) evaluate a multimodal Vision-Language Model (VLM) that integrates spectrograms with structured patient metadata. AST is initialized from publicly available weights and fine-tuned on a medical dataset containing hundreds of recordings per diagnosis. The VLM experiment uses a compact Moondream-type model that processes spectrogram images alongside a structured text prompt (sex, age, recording site) to output a JSON-formatted diagnosis. Results indicate that AST achieves approximately 97% accuracy with an F1-score around 97% and ROC AUC of 0.98 for asthma detection, significantly outperforming both the internal CNN baseline and typical external benchmarks. The VLM reaches 86-87% accuracy, performing comparably to the CNN baseline while demonstrating the capability to integrate clinical context into the inference process. These results confirm the effectiveness of self-attention for acoustic screening and highlight the potential of multimodal architectures for holistic diagnostic tools.

研究の動機と目的

トランスフォーマーベースのアーキテクチャが呼吸音からの喘息スクリーニングをCNNベースラインより改善するかを評価する。
医用呼吸データへAudio Spectrogram Transformer (AST) を適応させ、クラスごとに数百の録音を含むデータセットでファインチューンする。
スペクトログラムと構造化された患者メタデータを統合した診断用のVLM（Vision-Language Model）を開発する。

提案手法

喘息、健常、その他病理の1,613録音を含む医用呼吸音データセットでASTをファインチューンする。
ASTの入力として複数のウィンドウサイズから得られたメルスペクトログラムを3チャネルRGB風画像へ変換して使用する。
同一データ分割でASTとDenseNet201 CNNベースラインを比較する。
スペクトログラム由来画像と構造化メタデータ、および診断をJSON出力する指示プロンプトを受け付けるMoondream型VLMを開発する。
低ランク適応（LoRA）アダプタでVLMをファインチューンし、コアウェイトを凍結したまま最終分類ヘッドを訓練する。
最終評価は5sと10sの入力時間を評価し、最終評価には5sを採用する。

実験結果

リサーチクエスチョン

RQ1ASTは喘息スクリーニングでCNNベースラインより高い精度を提供できるか。
RQ2スペクトログラムと構造化メタデータを組み合わせたマルチモーダルVLMは診断性能を向上させるか、従来のCNNと競合する結果を示せるか。
RQ3臨床文脈（年齢、性別、録音場所）の組み込みはマルチモーダル設定で喘息分類にどのように影響するか。
RQ4臨床導入のためのCPU/GPU上でのASTおよびVLMの実沿革推論効率はどうか。

主な発見

ASTは約97%の精度、約97%のF1、Asthma対Not AsthmaでROC AUCが0.98を達成し、CNNベースラインを上回った。
VLMはAsthma対Not Asthmaで約86-87%の精度を達成し、Youden指数でDenseNetベースラインと同等の成績を示した。
アブレーションによりメタデータを除去すると性能劣化が著しく、テキスト条件付けが安定したVLM推論に不可欠であることが示された。
ASTは5秒クリップでも良好な性能を発揮し、10秒クリップと同等の性能を維持しつつ訓練サンプル数を増加させられる。
DenseNetベースラインの精度は約87%、感度約93%、特異度約82-86%で同じタスクに対する参照ポイントとなる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。