[Paper Review] Hybrid Task Cascade for Instance Segmentation
HTC interleaves detection and segmentation in a multi-stage cascade, adds mask feature flow and a semantic context branch, improving mask AP on COCO.
Cascade is a classic yet powerful architecture that has boosted performance on various tasks. However, how to introduce cascade to instance segmentation remains an open question. A simple combination of Cascade R-CNN and Mask R-CNN only brings limited gain. In exploring a more effective approach, we find that the key to a successful instance segmentation cascade is to fully leverage the reciprocal relationship between detection and segmentation. In this work, we propose a new framework, Hybrid Task Cascade (HTC), which differs in two important aspects: (1) instead of performing cascaded refinement on these two tasks separately, it interweaves them for a joint multi-stage processing; (2) it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background. Overall, this framework can learn more discriminative features progressively while integrating complementary features together in each stage. Without bells and whistles, a single HTC obtains 38.4 and 1.5 improvement over a strong Cascade Mask R-CNN baseline on MSCOCO dataset. Moreover, our overall system achieves 48.6 mask AP on the test-challenge split, ranking 1st in the COCO 2018 Challenge Object Detection Task. Code is available at: https://github.com/open-mmlab/mmdetection.
Motivation & Objective
- Motivate improving instance segmentation by leveraging cascade with strong information flow between tasks.
- Propose Hybrid Task Cascade (HTC) that interweaves detection and segmentation at each stage.
- Investigate the benefits of mask information flow and spatial context from a semantic branch.
- Demonstrate end-to-end trainability and state-of-the-art performance on COCO test-dev/test-challenge.
Proposed method
- Introduce a three-stage cascade where bbox regression and mask prediction are progressively refined in a joint multi-task pipeline.
- Add direct connections between mask branches across stages to enable mask information flow.
- Incorporate a fully convolutional semantic segmentation branch to provide spatial context and fuse its features with box/mask branches.
- Fuse semantic features with ROI features via RoIAlign for improved bbox and mask predictions.
- Train with a multi-task loss over stages and tasks, with balancing coefficients alpha_t and beta.
- Optionally extend with backbones and training tricks (DCN, SyncBN, multi-scale, ensemble) for further gains.
Experimental results
Research questions
- RQ1Can a cascaded, multi-task architecture improve both bounding box and mask predictions in instance segmentation?
- RQ2Does explicit mask information flow across stages enhance mask refinement?
- RQ3Does adding a spatial context semantic segmentation branch improve foreground-background discrimination?
- RQ4How do these design choices affect COCO mask AP and overall performance on test-dev/test-challenge?
Key findings
| Method | Backbone | box AP | mask AP | AP50 | AP75 | AP_S | AP_M | AP_L | runtime (fps) |
|---|---|---|---|---|---|---|---|---|---|
| Mask R-CNN | ResNet-50-FPN | 39.1 | 35.6 | 57.6 | 38.1 | 18.7 | 38.3 | 46.6 | 5.3 |
| Cascade Mask R-CNN | ResNet-50-FPN | 42.7 | 36.9 | 58.6 | 39.7 | 19.6 | 39.3 | 48.8 | 3.0 |
| HTC (ours) | ResNet-50-FPN | 43.6 | 38.4 | 60.0 | 41.5 | 20.4 | 40.7 | 51.2 | 2.5 |
| HTC (ours) | ResNet-101-FPN | 45.3 | 39.7 | 61.8 | 43.1 | 21.0 | 42.2 | 53.5 | 2.4 |
| HTC (ours) | ResNeXt-101-FPN | 47.1 | 41.2 | 63.9 | 44.7 | 22.8 | 43.9 | 54.6 | 2.1 |
- HTC yields higher mask AP than Mask R-CNN and Cascade Mask R-CNN baselines across backbones.
- HTC with ResNet-50-FPN, ResNet-101-FPN, and ResNeXt-101-FPN consistently improves mask AP by up to around 1.5 points over baselines.
- Interleaved execution provides modest gains; mask information flow provides further improvements (~0.6–1.5 AP).
- Semantic segmentation branch provides complementary context, contributing additional gains (~0.6 AP).
- On COCO test-dev, HTC with strong backbones and bells-and-whistles achieves 49.0 mask AP; on test-challenge, 48.6 mask AP.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.