QUICK REVIEW

[論文レビュー] Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers

Bo Dong, Wenhai Wang|arXiv (Cornell University)|Aug 16, 2021

Advanced Neural Network Applications参考文献 120被引用数 170

ひとこと要約

Polyp-PVT は三つの追加モジュール（CFM、CIM、SAM）を備えたピラミッド状ビジョントランスフォーマーエンコーダを導入し、ポリープセグメンテーションを改善。複数のベンチマークで Dice スコアが最先端または競争力を持つ。

ABSTRACT

Most polyp segmentation methods use CNNs as their backbone, leading to two key issues when exchanging information between the encoder and decoder: 1) taking into account the differences in contribution between different-level features and 2) designing an effective mechanism for fusing these features. Unlike existing CNN-based methods, we adopt a transformer encoder, which learns more powerful and robust representations. In addition, considering the image acquisition influence and elusive properties of polyps, we introduce three standard modules, including a cascaded fusion module (CFM), a camouflage identification module (CIM), and a similarity aggregation module (SAM). Among these, the CFM is used to collect the semantic and location information of polyps from high-level features; the CIM is applied to capture polyp information disguised in low-level features, and the SAM extends the pixel features of the polyp area with high-level semantic position information to the entire polyp area, thereby effectively fusing cross-level features. The proposed model, named Polyp-PVT, effectively suppresses noises in the features and significantly improves their expressive capabilities. Extensive experiments on five widely adopted datasets show that the proposed model is more robust to various challenging situations (e.g., appearance changes, small objects, rotation) than existing representative methods. The proposed model is available at https://github.com/DengPingFan/Polyp-PVT.

研究の動機と目的

ポリープセグメンテーションにおけるクロスレベル特徴融合での CNN バックボーンの限界に対処する。
頑健なマルチスケール表現を学習する Transformer ベースのエンコーダ（PVT）を導入する。
高レベル特徴と低レベル特徴を融合しノイズを抑制するための3つのモジュール（CFM、CIM、SAM）を提案する。
5つの難易度の高いデータセットで Polyp-PVT を評価し、最先端法と比較する。

提案手法

入力画像からマルチスケール特徴 X1–X4 を抽出するエンコーダとして Pyramid Vision Transformer (PVTv2) を採用する。
Cascaded Fusion Module (CFM) を用いて高レベル特徴を段階的に融合し T1 を生成する。
Camouflage Identification Module (CIM) を適用してチャネルと空間アテンションを介して低レベル特徴 X1 を T2 に強化する。
Non-local とグラフ畳み込み演算を組み合わせた Similarity Aggregation Module (SAM) を導入し、T1 と T2 を融合して最終特徴 Z を作る。
1x1 conv ヘッドでセグメンテーションを予測し、主損失（IoU + BCE）と中間出力用の補助損失で学習する。

実験結果

リサーチクエスチョン

RQ1標準ベンチマークにおいて、Polyp-PVT はCNNベースのバックボーンと比較してポリープセグメンテーションでどのような性能を示すか？
RQ2CFM、CIM、SAM が全体性能と騒音、カモフラージュ、クロスドメインデータなどの難条件への頑健性に対してどのように寄与するか？
RQ3内視鏡画像における外観の変化、小さなポリープ、回転に対してトランスフォーマーベースのエンコーダはどのように対応するか？

主な発見

Polyp-PVT はクロスデータセットで強い性能を達成し、例えば Kvasir-SEG（mDic 0.917）や ClinicDB（mDic 0.937）で。
ColonDB では Polyp-PVT が mDic 0.808 を達成し、SANet を一定のマージンで上回る（報告どおり）。
ETIS では Polyp-PVT が mDic 0.787 を達成し、SANet を顕著な差で上回る。
Endoscene では Polyp-PVT が mDic 0.900、mIoU 0.833 に達し、厳しい条件下での頑健な性能を示す。
全体として、Polyp-PVT は外観の変化、小さな物体、回転に対して頑健であり、SANet や PraNet などの代表的なベースラインを上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。