QUICK REVIEW

[論文レビュー] Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?

Sedigheh Eslami, Gerard de Melo|arXiv (Cornell University)|Dec 27, 2021

Multimodal Machine Learning Applications被引用数 41

ひとこと要約

PubMedCLIP は、医療ドメインにファインチューニングされた CLIP で、MAML ベースの視覚エンコーダーより MedVQA を最大 3% 向上させ、PubMedCLIP は複数の設定で一般的な CLIP を上回る。

ABSTRACT

Contrastive Language--Image Pre-training (CLIP) has shown remarkable success in learning with cross-modal supervision from extensive amounts of image--text pairs collected online. Thus far, the effectiveness of CLIP has been investigated primarily in general-domain multimodal problems. This work evaluates the effectiveness of CLIP for the task of Medical Visual Question Answering (MedVQA). To this end, we present PubMedCLIP, a fine-tuned version of CLIP for the medical domain based on PubMed articles. Our experiments are conducted on two MedVQA benchmark datasets and investigate two MedVQA methods, MEVF (Mixture of Enhanced Visual Features) and QCR (Question answering via Conditional Reasoning). For each of these, we assess the merits of visual representation learning using PubMedCLIP, the original CLIP, and state-of-the-art MAML (Model-Agnostic Meta-Learning) networks pre-trained only on visual data. We open source the code for our MedVQA pipeline and pre-training PubMedCLIP. CLIP and PubMedCLIP achieve improvements in comparison to MAML's visual encoder. PubMedCLIP achieves the best results with gains in the overall accuracy of up to 3%. Individual examples illustrate the strengths of PubMedCLIP in comparison to the previously widely used MAML networks. Visual representation learning with language supervision in PubMedCLIP leads to noticeable improvements for MedVQA. Our experiments reveal distributional differences in the two MedVQA benchmark datasets that have not been imparted in previous work and cause different back-end visual encoders in PubMedCLIP to exhibit different behavior on these datasets. Moreover, we witness fundamental performance differences of VQA in general versus medical domains.

研究の動機と目的

データ不足とドメイン特有の課題に起因する医療 VQA の視覚言語表現の改善を動機づける。
CLIP ベースの表現が一般ドメインと同様の利点を医療でも提供するか評価する。
PubMed由来の医療画像キャプション対 ROCO から CLIP をファインチューニングして PubMedCLIP を開発する。
確立された MedVQA バックボーン（MEVF と QCR）に PubMedCLIP を組み込み、性能向上を評価する。
再現性のため Open-source PubMedCLIP と MedVQA パイプラインを共有する。

提案手法

ROCO 医療画像–キャプション対で CLIP をファインチューニングする（PubMedCLIP） ViT32, RN-50, RN-50x4 バックエンドを使用。
MEVF の MAML 視覚エンコーダを PubMedCLIP の特徴に置換え、CDAE, GloVe, LSTM 質問エンコーダと BAN 融合を維持。
視覚と言語のクロスエントロピー損失と画像再構成損失で訓練; クロスエントロピーと再構成損失を平均。
MEVF と QCR 内で PubMedCLIP を VQA-RAD と SLAKE データセットで評価し、繰り返し実行（10x）で平均精度を報告。

実験結果

リサーチクエスチョン

RQ1PubMedCLIP は従来の視覚エンコーダー（一般ドメインCLIP や MAML ベースのモデルを含む）より MedVQA の性能を向上させるか？
RQ2回答 distribution が異なる異なる MedVQA データセットで PubMedCLIP の性能はどうか（局所化 vs 全体理解）？
RQ3PubMedCLIP を使用する際、バックエンド視覚エンコーダにデータセット特有の差はあるか？
RQ4medicine における VQA の医療ドメイン監督と一般ドメイン監督の相対的な利得はどの程度か？

主な発見

モデル	エンコーダー	Open (VQA-RAD)	Closed (VQA-RAD)	Overall (VQA-RAD)	Open (SLAKE)	Closed (SLAKE)	Overall (SLAKE)
MAML + AE	(not PubMedCLIP)	60.8%	73.2%	69.2%	76.8%	80.6%	78.3%
MEVF	CLIP-ViT-B + AE	65.4%	75.0%	65.4%?	80.5%	77.7%	79.5%
MEVF	CLIP-RN50 + AE	71.3%	80.0%	71.3%	81.5%	79.7%	79.7%
MEVF	CLIP-RN50x4 + AE	71.3%	79.4%	71.3%	80.5%	78.7%	78.7%
PubMedCLIP-ViT-B + AE	PubMedCLIP-ViT-B + AE	71.1%	79.5%	71.1%	82.5%	80.1%	80.1%
PubMedCLIP-RN50 + AE	PubMedCLIP-RN50 + AE	72.1%	80.0%	72.1%	81.4%	79.3%	79.3%
PubMedCLIP-RN50x4 + AE	PubMedCLIP-RN50x4 + AE	71.8%	79.7%	71.8%	81.3%	79.1%	79.1%

PubMedCLIP および CLIP ベースのエンコーダは MAML ベースの視覚エンコーダより MedVQA の精度を向上させる。
PubMedCLIP は複数の設定で最良の結果を出し、全体の改善はバックエンドに依存して SLAKE と VQA-RAD で最大 3% となる。
PubMedCLIP-RN50x4 は、特定のデータセットで過学習のため PubMedCLIP-RN50 より劣る場合があり、ViT ベースのバックエンドは SLAKE で優秀。
データセット分布の違い（VQA-RAD vs SLAKE）が、どの視覚バックエンド（ResNet vs ViT）がより効果的かを左右する。
定性的な例では、同じ質問-画像入力を使用した場合、PubMedCLIP は MEVF よりもより正確で関連性の高い回答を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。