QUICK REVIEW

[論文レビュー] Weakly-supervised DCNN for RGB-D Object Recognition in Real-World Applications Which Lack Large-scale Annotated Training Data

Sun Li, Cheng Zhao|arXiv (Cornell University)|Mar 19, 2017

Advanced Neural Network Applications参考文献 3被引用数 54

ひとこと要約

本論文は、少量のラベル付きデータと大規模なラベルなしRGB-Dデータを活用する弱教師付きRGB-D物体認識フレームワーク（DCNN-GPC）を提案し、ガウス過程分類を用いてラベルを伝播させ、境界ボックス注釈なしでマルチモーダルDCNNのエンドツーエンド訓練を実現します。さらに、CADモデルから合成depthマップを用いた深度事前学習と、リアルタイム検出のための境界認識3D物体検出器を導入します。

ABSTRACT

This paper addresses the problem of RGBD object recognition in real-world applications, where large amounts of annotated training data are typically unavailable. To overcome this problem, we propose a novel, weakly-supervised learning architecture (DCNN-GPC) which combines parametric models (a pair of Deep Convolutional Neural Networks (DCNN) for RGB and D modalities) with non-parametric models (Gaussian Process Classification). Our system is initially trained using a small amount of labeled data, and then automatically prop- agates labels to large-scale unlabeled data. We first run 3D- based objectness detection on RGBD videos to acquire many unlabeled object proposals, and then employ DCNN-GPC to label them. As a result, our multi-modal DCNN can be trained end-to-end using only a small amount of human annotation. Finally, our 3D-based objectness detection and multi-modal DCNN are integrated into a real-time detection and recognition pipeline. In our approach, bounding-box annotations are not required and boundary-aware detection is achieved. We also propose a novel way to pretrain a DCNN for the depth modality, by training on virtual depth images projected from CAD models. We pretrain our multi-modal DCNN on public 3D datasets, achieving performance comparable to state-of-the-art methods on Washington RGBS Dataset. We then finetune the network by further training on a small amount of annotated data from our novel dataset of industrial objects (nuclear waste simulants). Our weakly supervised approach has demonstrated to be highly effective in solving a novel RGBD object recognition application which lacks of human annotations.

研究の動機と目的

注釈データが乏しい現実世界の設定でRGB-D物体認識の動機づけを行う。
RGBと深度のDCNNを組み合わせ、Gaussian Process Classificationを用いた弱教師付き学習アーキテクチャを開発する。
小さなラベル付きデータと大規模なラベルなし提案データを用いて、マルチモーダルDCNNのエンドツーエンド訓練を可能にする。
3D情報を活用するため、深度ネットワークを合成CAD生成の深度画像で事前訓練する。
産業用RGB-Dデータに対して境界認識機能を備えたリアルタイム検出を実演し、完全監視付きベースラインと比較する。

提案手法

RGB-Net、Depth-Net、および非パラメトリックGaussian Process Classifier (GPC) という3要素アーキテクチャを使用。
RGB-NetをImageNetで事前訓練し、Depth-NetをModel-Netで合成深度マップを生成するCADモデルからの深度マップで事前訓練する。
DCNN-GPCを用いたマルチモーダルカーネルで、手動でラベル付けされたオブジェクト性提案の小さなセットからラベルを大規模な unlabeled セットへ伝播させる。
GPでラベル付けされたデータを組み込んだソフトマックス損失を用いて、エンドツーエンドのマルチモーダルDCNNを学習する（弱教師付き）。
境界ボックスなしで境界認識RGB-D提案を生成するリアルタイム3Dオブジェクト性検出器を採用する。
GPラベル伝播とDCNN微調整を統合した場合のエンドツーエンド訓練を、カーネルの積を通じてRGBと深度の特徴を融合し、ハイパーパラメータ調整を行いEPベースの後方伝搬を最適化する。

実験結果

リサーチクエスチョン

RQ1Can a weakly-supervised RGB-D object recognition system achieve competitive performance with minimal labeled data compared to fully supervised methods?
RQ2Does leveraging synthetic-depth pretraining for the depth network improve transfer to real-world RGB-D data without color-mapping?
RQ3Can a 3D-based objectness detector provide boundary-aware proposals suitable for end-to-end DCNN-GPC training?
RQ4How well does a multi-modal DCNN trained with GP-labeled data perform on real industrial RGB-D recognition tasks?
RQ5What are the end-to-end training benefits when integrating GP label propagation with DCNN finetuning?

主な発見

Depth-Net pretrained from synthetic CAD-depth maps enables effective end-to-end learning on raw depth data without color-mapped input.
On the Washington RGB-D dataset, the proposed multi-modal DCNN achieves 91.8% recognition accuracy across 51 categories, outperforming most DCNN-based methods.
Depth pretraining on Model-Net yields competitive results for 3D-depth related tasks and facilitates transfer to Kinect-derived RGB-D data.
In industrial RGB-D data, the system achieves 80.85% instance-wise precision, 83.53% recall, and 82.17% F-score, with 75.52% precision, 70.39% recall, and 72.87% F-score pixel-wise.
The pipeline runs near real-time at 2-3 Hz (with down-sampling and lighter networks increasing to ~5 Hz), substantially faster than prior bounding-box based methods.
Compared with fully supervised R-CNN baselines, the weakly-supervised approach shows robustness to scale and pose variation due to automatic GP-driven labeling.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。