[Paper Review] CCNet: Criss-Cross Attention for Semantic Segmentation
CCNet introduces a recurrent criss-cross attention module to capture full-image contextual information efficiently, achieving state-of-the-art segmentation results with lower memory and computation than non-local approaches.
Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a Criss-Cross Network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11x less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. 3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9%, 45.76% and 55.47% on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at \url{https://github.com/speedinghzl/CCNet}.
Motivation & Objective
- Motivate and model full-image contextual information for dense semantic segmentation.
- Design a lightweight attention module that aggregates context along criss-cross paths.
- Increase discriminative power with a category consistent loss.
- Extend the approach to 3D for video tasks and temporal context.
- Demonstrate state-of-the-art performance on multiple segmentation benchmarks.
Proposed method
- Propose a criss-cross attention module that attends along the row and column of each pixel, reducing attention weights to about 2√N per position.
- Apply a recurrent RCCA (RCCA) by stacking two criss-cross attention modules to propagate information to all pixels.
- Share RCCA parameters and fuse dense context with local features for segmentation predictions.
- Introduce a category consistent loss to encourage intra-class feature compactness and inter-class separation.
- Extend RCCA to 3D for video data and temporal context integration.
Experimental results
Research questions
- RQ1Can criss-cross attention efficiently capture full-image context for dense predictions?
- RQ2Does recurrent stacking of criss-cross attention achieve full image dependencies with reduced computation and memory?
- RQ3Does a category-consistent loss improve discriminability of RCCA features?
- RQ4How does CCNet performance compare to non-local and other context-aggregation methods on major segmentation benchmarks?
- RQ5Can the approach be extended to 3D to handle temporal context in video data?
Key findings
- CCNet achieves state-of-the-art results on Cityscapes test (mIoU 81.9%), ADE20K validation (mIoU 45.76%), and LIP validation (mIoU 55.47%).
- The criss-cross attention module reduces memory usage by about 11× and FLOPs by about 85% compared with non-local blocks."
- RCCA enables dense contextual information gathering with two sequential criss-cross attention passes while sharing parameters.
- Category consistent loss improves feature discrimination and segmentation performance when combined with RCCA.
- 3D criss-cross attention extends the approach to temporal contexts for video segmentation tasks.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.