QUICK REVIEW

[논문 리뷰] Segmenting Transparent Object in the Wild with Transformer

Enze Xie, Wenjia Wang|arXiv (Cornell University)|2021. 01. 21.

Advanced Neural Network Applications참고 문헌 41인용 수 25

한 줄 요약

논문은 Trans10K-v2를 도입하고 11 categories가 있는 미세한 투명 물체 분할 데이터세트와 Trans2Seg라는 트랜스포머 기반 분할 모델을 제시하여 Trans10K-v2에서 최첨단 성능을 달성하고 ADE20K로의 전이 성능도 보인다.

ABSTRACT

This work presents a new fine-grained transparent object segmentation dataset, termed Trans10K-v2, extending Trans10K-v1, the first large-scale transparent object segmentation dataset. Unlike Trans10K-v1 that only has two limited categories, our new dataset has several appealing benefits. (1) It has 11 fine-grained categories of transparent objects, commonly occurring in the human domestic environment, making it more practical for real-world application. (2) Trans10K-v2 brings more challenges for the current advanced segmentation methods than its former version. Furthermore, a novel transformer-based segmentation pipeline termed Trans2Seg is proposed. Firstly, the transformer encoder of Trans2Seg provides the global receptive field in contrast to CNN's local receptive field, which shows excellent advantages over pure CNN architectures. Secondly, by formulating semantic segmentation as a problem of dictionary look-up, we design a set of learnable prototypes as the query of Trans2Seg's transformer decoder, where each prototype learns the statistics of one category in the whole dataset. We benchmark more than 20 recent semantic segmentation methods, demonstrating that Trans2Seg significantly outperforms all the CNN-based methods, showing the proposed algorithm's potential ability to solve transparent object segmentation.

연구 동기 및 목표

로봇 공학 및 비전 시스템을 위한 실제 세계 장면에서 매우 투명한 물체의 견고한 분할을 위한 동기를 부여합니다.
고품질 마스크와 기능적 카테고리를 갖춘 크고 다양한 미세한 데이터세트(Trans10K-v2)를 제공합니다.
전역 맥락과 학습 가능한 카테고리 프로토타입을 활용하여 정확한 마스크 예측을 제공하는 트랜스포머 기반 분할 아키텍처(Trans2Seg)를 제안합니다.

제안 방법

Trans2Seg를 제안합니다. CNN-백본과 인코더-디코더 트랜스포머 아키텍처.
마지막 단계 확장을 가진 ResNet-50 등 CNN 백본을 사용하여 특징을 추출하고 트랜스포머 인코더를 위한 특징 맵을 제공합니다.
학습 가능한 클래스 프로토타입 세트를 쿼리로 사용하여 인코더 특징에 주의(attend)하는 트랜스포머 디코더를 활용하여 딕셔너리와 같은 카테고리 조회로 가능하게 합니다.
소형 합성곱 헤드를 통해 디코더 주의 맵을 고해상도 CNN 특징과 업샘플링 및 융합하여 최종 픽셀 단위 분류를 argmax로 수행합니다.
피처 맵을 평탄화한 후 공간 정보를 회복하기 위해 트랜스포머 인코더에 위치 임베딩을 도입합니다.
트랜스포머 인코더-디코더 설계를 SETR 및 DETR과 비교하고, 카테고리 프로토타입 쿼리가 의미론적 분할에서 핵심 차이점임을 강조합니다.

실험 결과

연구 질문

RQ1트랜스포머 기반 파이프라인이 CNN 기반 방법보다 미세한 투명 물체 분할을 개선할 수 있습니까?
RQ2학습 가능한 카테고리 프로토타입을 사용한 딕셔너리 조회로 분할을 모델링하는 것이 마스크 품질과 카테고리 구별력을 향상시킵니까?
RQ3Trans2Seg가 대규모의 미세한 투명 물체 데이터세트와 ADE20K와 같은 일반 분할 벤치마크에서 어떤 성능을 보입니까?

주요 결과

Trans10K-v2 contains 10,428 images with 11 fine-grained categories (shelf, jar, freezer, window, glass door, eyeglass, cup, glass wall, glass bowl, water bottle, storage box).
Trans2Seg significantly outperforms CNN-based methods on Trans10K-v2, achieving 72.15% mIoU and 94.14% pixel accuracy (vs. 69.00 mIoU for the previous SOTA TransLab).
Transformer encoder provides a larger global receptive field than CNNs, improving segmentation of transparent objects.
Replacing a CNN decoder with a Transformer decoder that uses learnable category prototypes as queries yields a further mIoU improvement (up to 72.1% in ablations).
On ADE20K, Trans2Seg reaches 39.7 mIoU, demonstrating transferability to general segmentation tasks.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.