QUICK REVIEW

[논문 리뷰] CCNet: Criss-Cross Attention for Semantic Segmentation

Zilong Huang, Xinggang Wang|arXiv (Cornell University)|2018. 11. 28.

Advanced Neural Network Applications참고 문헌 77인용 수 349

한 줄 요약

CCNet은 전체 이미지의 맥락 정보를 효율적으로 포착하기 위해 순환적인 크리스-크로스 어텐션 모듈을 도입하여, 비-로컬 접근 방식보다 더 적은 메모리와 계산으로 최첨단 세그멘테이션 성능을 달성합니다.

ABSTRACT

Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a Criss-Cross Network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11x less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. 3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9%, 45.76% and 55.47% on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at \url{https://github.com/speedinghzl/CCNet}.

연구 동기 및 목표

밀집 시맨틱 세그멘테이션을 위한 전체 이미지 맥락 정보를 동기화하고 모델링한다.
크리스-크로스 경로를 따라 컨텍스트를 축적하는 경량 어텐션 모듈을 설계한다.
카테고리 일관성 손실로 판별력을 높인다.
비디오 작업과 시간적 맥락 처리를 위해 3D로 확장한다.
다수의 세그멘테이션 벤치마크에서 최첨단 성능을 입증한다.

제안 방법

각 픽셀의 행과 열을 따라 주의를 기울이는 크리스-크로스 어텐션 모듈을 제안하여 각 위치의 어텐션 가중치를 약 2√N으로 감소시킨다.
두 개의 크리스-크로스 어텐션 모듈을 쌓아 순환 RCCA(RCCA)를 적용하여 정보를 모든 픽셀로 전파한다.
RCCA 매개변수를 공유하고 조밀한 맥락을 국부 특징과 융합하여 세그멘테이션 예측을 수행한다.
카테고리 일관성 손실을 도입하여 클래스 내 특징의 응집과 클래스 간 분리를 촉진한다.
비디오 데이터와 시간적 맥락 통합을 위해 RCCA를 3D로 확장한다.

실험 결과

연구 질문

RQ1크리스-크로스 어텐션이 밀집 예측을 위한 전체 이미지 맥락을 효율적으로 포착할 수 있는가?
RQ2크리스-크로스 어텐션의 순환적 스택은 계산과 메모리 감소로 전체 이미지 의존성을 달성하는가?
RQ3카테고리 일관성 손실이 RCCA 특징의 판별력을 향상시키는가?
RQ4주요 세그멘테이션 벤치마크에서 CCNet의 성능이 비로컬 및 다른 맥락-집약 방법과 어떻게 비교되는가?
RQ5이 접근법을 3D로 확장하여 비디오 데이터의 시간적 맥락을 다룰 수 있는가?

주요 결과

CCNet은 Cityscapes 테스트에서 최첨단 성능(mIoU 81.9%), ADE20K 검증에서 최첨단 성능(mIoU 45.76%), LIP 검증에서 최첨단 성능(mIoU 55.47%)을 달성했다.
크리스-크로스 어텐션 모듈은 비로컬 블록과 비교하여 메모리 사용량을 약 11배, FLOPs를 약 85% 감소시킨다.
RCCA는 매개변수를 공유하면서 두 차례의 순차적 크리스-크로스 어텐션 패스로 조밀한 맥락 정보를 수집하도록 한다.
카테고리 일관성 손실은 RCCA와 결합될 때 특징 판별력과 세그멘테이션 성능을 향상시킨다.
3D 크리스-크로스 어텐션은 비디오 세그멘테이션 작업에 대한 시간적 맥락으로 접근법을 확장한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.