QUICK REVIEW

[論文レビュー] Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation

Xiaokang Chen, Kwan-Yee Lin|arXiv (Cornell University)|Jul 17, 2020

Advanced Neural Network Applications参考文献 55被引用数 51

ひとこと要約

論文は、 Separation-and-Aggregation (SA) Gate と Bi-direction Multi-step Propagation (BMP) を備えた双方向のクロスモダリティエンコーダを提案し、RGB-D セマンティックセグメンテーションのために RGB とノイズのある深度（HHA）信号を頑健に融合します。NYU Depth V2 と CityScapes において、既存のバックボーンに組み込むと最先端の結果を達成します。

ABSTRACT

Depth information has proven to be a useful cue in the semantic segmentation of RGB-D images for providing a geometric counterpart to the RGB representation. Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion to obtain better feature representations to achieve more accurate segmentation. This, however, may not lead to satisfactory results as actual depth data are generally noisy, which might worsen the accuracy as the networks go deeper. In this paper, we propose a unified and efficient Cross-modality Guided Encoder to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively. The key of the proposed architecture is a novel Separation-and-Aggregation Gating operation that jointly filters and recalibrates both representations before cross-modality aggregation. Meanwhile, a Bi-direction Multi-step Propagation strategy is introduced, on the one hand, to help to propagate and fuse information between the two modalities, and on the other hand, to preserve their specificity along the long-term propagation process. Besides, our proposed encoder can be easily injected into the previous encoder-decoder structures to boost their performance on RGB-D semantic segmentation. Our model outperforms state-of-the-arts consistently on both in-door and out-door challenging datasets. Code of this work is available at https://charlescxk.github.io/

研究の動機と目的

ノイズがあり位置ズレした深度データがある野外条件下で、堅牢な RGB-D 融合を動機づける。
融合前に各モダリティを再校正するクロスモダリティ誘導エンコーダを開発する。
深度ノイズを除去し適応的にモダリティを融合する Separation-and-Aggregation Gate (SA-Gate) を導入する。
エンコード中にモダリティ特異性を保持する Bi-direction Multi-step Propagation (BMP) を組み込む。
既存の RGB セグメンテーションデコーダとのプラグアンドプレー互換性を示し、性能を向上させる。

提案手法

SA-Gate は、クロスモダリティアテンションを用いてノイズの多い深度特徴を分離する Feature Separation (FS) と、RGB と深度を融合する Feature Aggregation (FA) をスペーシャルゲートで実現する。
FS は、結合された RGB と深度にグローバルプーリングを適用してクロスモダリティアテンションベクターを作成し、チャネルごとのスケーリングで深度をフィルタリングし、RGB_in に対して RGB_rec = HHA_filtered + RGB_in を再校正する。
FA は、再校正された RGB と HHA からスペーシャルゲートを生成し、可加的重みとして RGB_in と HHA_in の加重融合 M を得る。A_rgb と A_hha はソフトマックス規則化された空間重みとして機能する。
最終の残差様な融合により RGB_out と HHA_out がエンコーダ内で前方伝搬される（双方向伝搬）。
BMP は、融合特徴を層間で伝搬させ、エンコーダ全体で表現を洗練しつつモダリティ特異性を維持する。

実験結果

リサーチクエスチョン

RQ1深度ノイズ下で特徴をExplicitに分離してから集約するクロスモダリティゲートはRGB-D セマンティメーションを改善するか？
RQ2双方向特徴伝搬はモダリティ特異情報を保持しつつ効果的なクロスモダリティ融合を可能にするか？
RQ3提案エンコーダは既存のRGBベースのバックボーンにどれだけ良く適合し、室内・屋外データセットで性能を向上させるか？
RQ4SA-Gate と BMP はRGB-Dベースラインおよび既存RGB-D手法と比較して精度と効率にどのような影響を与えるか？

主な発見

手法	mIoU (%)	ピクセル精度
RGB-D baseline	46.7	-
Ours	52.4	77.9

NYU Depth V2 で提案手法は mIoU 52.4 および Pixel Acc 77.9 を達成し、RGB-D ベースライン（46.7 mIoU）を上回る。
アプローチはデコーダ全体で大きな改善を提供し、プラグアンドプレー機能を実証する。
CityScapes の実験は強い利得を示し、深度がノイズのある場合でも検証データで最先端の性能、テスト結果でも競争力を発揮し、RGB ベースラインより大きな改善を達成。
SA-Gate + BMP は、それぞれの成分単独よりも大きな利益を生み出し、クロスモダリティ特徴伝搬における相補的な役割を示す。
モデルはRGB-Dベースラインと比較してメモリと計算量を抑えつつ、精度を高める（例：表1はRGB-Dベースラインより低いFLOPsでより良い mIoU を報告）。
定性的ビジュアライゼーションは SA-Gate がモダリティ特有のフォーカスを学習することを示し（RGB はディテール、HHA は照明に頑健な領域）、境界とテクスチャの処理を改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。