QUICK REVIEW

[논문 리뷰] SelfReformer: Self-Refined Network with Transformer for Salient Object Detection

Yi Ke Yun, Weisi Lin|arXiv (Cornell University)|2022. 05. 23.

Visual Attention and Saliency Detection인용 수 36

한 줄 요약

SelfReformer는 패치-단 글로벌 컨텍스트 분기, ground-truth 보존용 Pixel Shuffle, 글로벌 및 로컬 컨텍스트를 융합하여 고품질 두드러진 객체 검출을 실현하는 Transformer 기반 인코더와 Context Refinement Module을 도입하여 여러 벤치마크에서 최첨단 성능을 달성한다.

ABSTRACT

The global and local contexts significantly contribute to the integrity of predictions in Salient Object Detection (SOD). Unfortunately, existing methods still struggle to generate complete predictions with fine details. There are two major problems in conventional approaches: first, for global context, high-level CNN-based encoder features cannot effectively catch long-range dependencies, resulting in incomplete predictions. Second, downsampling the ground truth to fit the size of predictions will introduce inaccuracy as the ground truth details are lost during interpolation or pooling. Thus, in this work, we developed a Transformer-based network and framed a supervised task for a branch to learn the global context information explicitly. Besides, we adopt Pixel Shuffle from Super-Resolution (SR) to reshape the predictions back to the size of ground truth instead of the reverse. Thus details in the ground truth are untouched. In addition, we developed a two-stage Context Refinement Module (CRM) to fuse global context and automatically locate and refine the local details in the predictions. The proposed network can guide and correct itself based on the global and local context generated, thus is named, Self-Refined Transformer (SelfReformer). Extensive experiments and evaluation results on five benchmark datasets demonstrate the outstanding performance of the network, and we achieved the state-of-the-art.

연구 동기 및 목표

Transformer 백본을 통해 장기 의존성을 활용하여 SOD를 개선하려는 동기 부여.
글로벌 및 로컬 컨텍스트를 명시적으로 모델링하고 융합하여 예측의 완성도와 세부 정보를 향상시킴.
다운샘플링 중 ground-truth 세부 정보를 보존하기 위해 Pixel Shuffle을 사용하여 정보 손실을 방지.
전역 및 로컬 단서를 사용하여 예측을 정제하는 두 단계 Context Refinement Module (CRM)을 개발.
여러 SOD 벤치마크 데이터세트에서 최첨단 성능을 입증합니다.

제안 방법

Long-range 의존성을 모델링하기 위해 Pyramid Vision Transformer (PVT)을 인코더로 사용.
패치-와이즈 관심도 예측을 통해 명시적 글로벌 컨텍스트 맵을 학습하는 글로벌 컨텍스트 분기를 도입.
decoder 단계 간의 업/다운샘플링에 Pixel Shuffle을 적용하여 예측의 미세한 디테일을 보존.
글로벌 컨텍스트와 디코더 특징을 융합하고 향상된 로컬 컨텍스트 맵을 출력하는 Context Refinement Module (CRM)을 개발.
패치-와이즈 글로벌 컨텍스트 손실과 디코더 단계에 대한 가중된 BCE 손실의 공동 손실로 학습하며, 두 단계의 CRM 정제 프로세스를 갖춘다.

실험 결과

연구 질문

RQ1Transformer 기반 인코더가 CNN 백본에 비해 SOD에서 글로벌 구조적 무결성을 향상시킬 수 있는가?
RQ2감독되는 패치-와이즈 글로벌 컨텍스트 작업이 디코딩을 안내하기 위한 명시적이고 제어 가능한 글로벌 컨텍스트를 제공할 수 있는가?
RQ3Pixel Shuffle이 보간이나 풀링보다 SOD의 디코더 단계 전반에 걸쳐 ground-truth 세부 정보를 더 잘 보존하는가?
RQ4컨텍스트 정제 모듈이 글로벌 컨텍스트와 디코더 특징을 효과적으로 융합하여 국부 세부 정확도를 향상시킬 수 있는가?

주요 결과

제안된 SelfReformer는 표준 SOD 지표에서 다섯 개 벤치마크 데이터세트 전반에 걸쳐 최첨단 성능을 달성한다.
패치-와이즈 관심도 예측을 통해 학습된 글로벌 컨텍스트 분기가 관심도 로컬라이제이션에 측정 가능한 이점을 제공한다.
Pixel Shuffle은 보간이나 풀링보다 미세한 ground-truth 구조를 더 잘 보존하여 예측의 세부 충실도를 향상시킨다.
CRM(Context Refinement Module)은 글로벌 및 로컬 컨텍스트를 활용하여 예측의 완전성을 개선하고 국부 세부를 정제한다.
정성적 결과는 구조적 완전성이 향상되고 더 풍부한 디테일을 보여주며, 작은 객체나 다중 관심 영역이 있는 도전적인 장면을 포함한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.