QUICK REVIEW

[論文レビュー] Quilt-1M: One Million Image-Text Pairs for Histopathology

Wisdom Oluchi Ikezogwo, Mehmet Saygın Seyfioğlu|PubMed|Jun 20, 2023

AI in cancer detection参考文献 21被引用数 54

ひとこと要約

Quilt-1M は Quilt および追加ソースから構築された大規模なオープンソースの病理組織画像-テキストデータセット（1Mの画像-textペア）であり、QuiltNet という CLIP に類似したモデルを事前学習させ、13の外部病理組織データセットでゼロショット、線形探索、クロスモーダル検索の最先端を達成します。

ABSTRACT

Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of analogous data in the medical field, specifically in histopathology, has halted comparable progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering 1,087 hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate Quilt: a large-scale vision-language dataset consisting of 768,826 image and text pairs. Quilt was automatically curated using a mixture of models, including large language models, handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around 200K samples. We combine Quilt with datasets from other sources, including Twitter, research papers, and the internet in general, to create an even larger dataset: Quilt-1M, with 1M paired image-text samples, marking it as the largest vision-language histopathology dataset to date. We demonstrate the value of Quilt-1M by fine-tuning a pre-trained CLIP model. Our model outperforms state-of-the-art models on both zero-shot and linear probing tasks for classifying new histopathology images across 13 diverse patch-level datasets of 8 different sub-pathologies and cross-modal retrieval tasks.

研究の動機と目的

病理組織学における大規模なビジョン-言語データの必要性を動機づけ、単一ラベルの注釈を超えたより豊かな表現を実現する。
自動キュレーションと品質管理を備え、YouTubeの病理組織動画から Quilt を作成し、豊かな画像-テキストのペアを生成する。
追加の公開病理データソース（LAION、Twitter、PubMed）を統合して Quilt を拡張し、 Quilt-1M を形成して多様性を高める。
Quilt-1M の有用性を、CLIPスタイルのモデル（QuiltNet）をファインチューニングして、さまざまな下流の病理組繼データセットで評価することにより示す。

提案手法

YouTube病理組織動画1,087時間から Quilt をキュレーションし、倍率 10x–40x の範囲で437,878枚の画像と802,144 aligned text ペアを生成する。
ヒストopathology 画像分類器、ASR、LLMs、UMLS のようなドメインデータベースを組み合わせて、動画フレームとナレーションから画像-テキストペアを抽出し、ノイズを除去する。
ASR、RAKE キーワード抽出、UMLS の検証、LLM ベースの修正を組み合わせた4段階のテキストノイズ除去と品質管理パイプラインを適用し、医学的に関連する医療および ROI テキストを取得する。
動画をシーンチャンクに分割し、チャンクレベルのASR由来の医学／ROI テキストを抽出し、代表的な画像を選択し、キーワードの重複を介して画像を関連テキストにマッピングすることで、画像と言語モダリティを整合させる。
PubMed Open Access、LAION-5B由来の病理組織データ、OpenPath Twitterデータを組み合わせて Quilt-1M を構築し、100万件の画像-テキストペアを形成する。
OpenAI CLIP ベースラインをファインチューニングして QuiltNet を作成し、13の外部病理組織データセットでゼロショット、線形プローブ、クロスモーダル検索を評価する。

実験結果

リサーチクエスチョン

RQ1大規模で複数ソースのビジョン-言語病理組織データセットは、ゼロショットおよび少数ショットの regime で病理組織タスクの表現学習を改善できるのか？
RQ2Quilt-1M（QuiltNet）で訓練された CLIP スタイルのモデルは、さまざまなサブ病理学的病変や検索タスクで既存の病理組織ビジョン-言語モデルより優れているか？
RQ3YouTube のナarrative ビデオデータとドメイン特化のテキスト処理を使用することが、病理組織における画像-テキストの整合性の品質と有用性にどのような影響を与えるか？
RQ4Quilt を追加データソース（LAION、PubMed、Twitter）と統合することが、病理組織の分類と検索の下流パフォーマンスにどう影響するか？

主な発見

Quilt-1M はこれまでで最大のオープンビジョン-言語病理組織データセットで、1,000,000 の画像-テキストペアを含む（Quilt には 437,878 枚の画像と 802,144 テキストペアがあり; Quilt-1M は Quilt と他のソースを組み合わせたもの）。
QuiltNet, finetuned from a pre-trained CLIP model on Quilt-1M, outperforms CLIP, BiomedCLIP, and PLIP baselines on zero-shot and linear probing across 13 external histopathology datasets.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。