[Paper Review] Fine-grained Visual Classification with High-temperature Refinement and Background Suppression
HERBS introduces a Background Suppression module and a High-temperature Refinement module to improve FGVC by pruning background noise and learning diverse multi-scale features, achieving state-of-the-art results on CUB-200-2011 and NABirds.
Fine-grained visual classification is a challenging task due to the high similarity between categories and distinct differences among data within one single category. To address the challenges, previous strategies have focused on localizing subtle discrepancies between categories and enhencing the discriminative features in them. However, the background also provides important information that can tell the model which features are unnecessary or even harmful for classification, and models that rely too heavily on subtle features may overlook global features and contextual information. In this paper, we propose a novel network called ``High-temperaturE Refinement and Background Suppression'' (HERBS), which consists of two modules, namely, the high-temperature refinement module and the background suppression module, for extracting discriminative features and suppressing background noise, respectively. The high-temperature refinement module allows the model to learn the appropriate feature scales by refining the features map at different scales and improving the learning of diverse features. And, the background suppression module first splits the features map into foreground and background using classification confidence scores and suppresses feature values in low-confidence areas while enhancing discriminative features. The experimental results show that the proposed HERBS effectively fuses features of varying scales, suppresses background noise, discriminative features at appropriate scales for fine-grained visual classification.The proposed method achieves state-of-the-art performance on the CUB-200-2011 and NABirds benchmarks, surpassing 93% accuracy on both datasets. Thus, HERBS presents a promising solution for improving the performance of fine-grained visual classification tasks. code: https://github.com/chou141253/FGVC-HERBS
Motivation & Objective
- Address the challenge of discriminating visually similar fine-grained categories while leveraging contextual/background information.
- Develop a modular framework that can integrate with CNN or transformer backbones in an end-to-end manner.
- Enhance feature learning via background suppression and high-temperature refinement to fuse multi-scale features.
Proposed method
- Introduce Background Suppression (BS) to classify regions by confidence, merge top-k features with a graph conv-based selector/combiner, and use a dropped loss to suppress background features.
- Apply high-temperature refinement to learn diverse, multi-scale features by training with high initial temperatures that decay over epochs.
- Combine BS and High-temperature Refinement into the HERBS module and integrate with backbones (CNN or Transformer) via top-down and bottom-up feature fusion modules.
- Use a combined loss (loss_bs = loss_m + lambda_d loss_d + lambda_l loss_l) to train BS, with a refinement loss guided by KL-divergence between multi-class outputs at different temperatures.
- Employ a temperature decay schedule T_e to encourage exploration early in training and finer discrimination later (T_e decays from an initial high value).
- Evaluate with CUB-200-2011 and NABirds benchmarks, using standard data augmentation and training settings; provide open-source code at the referenced GitHub repository.

Experimental results
Research questions
- RQ1Can background suppression improve FGVC without discarding useful contextual information?
- RQ2Does learning features at multiple scales with a high-temperature refinement strategy yield better discriminative representations for fine-grained categories?
- RQ3Are BS and high-temperature refinement effective across backbone types (CNNs and Vision Transformers) in FGVC?
- RQ4What is the impact of fusion strategies (top-down vs bottom-up paths) and multi-class classifiers on FGVC accuracy?
Key findings
| Dataset | Method | Top-1 Accuracy (%) |
|---|---|---|
| CUB-200-2011 | FFVT | 91.6 |
| CUB-200-2011 | ViT-NeT | 91.7 |
| CUB-200-2011 | TransFG | 91.7 |
| CUB-200-2011 | IELT | 91.8 |
| CUB-200-2011 | SIM-Trans | 91.8 |
| CUB-200-2011 | SAC | 91.8 |
| CUB-200-2011 | CAP | 91.9 |
| CUB-200-2011 | SR-GNN | 91.9 |
| CUB-200-2011 | DCAL | 92.0 |
| CUB-200-2011 | MetaFormer | 92.4 |
| CUB-200-2011 | HERBS | 93.1 |
| NA-Birds | FFVT | N/A |
| NA-Birds | CAP | 91.0 |
| NA-Birds | SR-GNN | 91.2 |
| NA-Birds | MetaFormer | 92.7 |
| NA-Birds | HERBS | 93.0 |
- Achieves state-of-the-art Top-1 accuracy on CUB-200-2011 (93.1%) and NABirds (93.0%).
- HERBS with full modules outperforms baseline backbones and various module combinations across Swin Transformer and ResNet-50 backbones.
- BS module reduces background noise and improves discriminative feature concentration, as shown by ablation studies and heat-map analyses.
- High-temperature refinement promotes learning of diverse and broader feature representations and improves accuracy across scales.
- Adding the full HERBS framework yields larger accuracy gains than any single module alone (e.g., +1.0 to +1.6 percentage points depending on backbone).
- Table IV indicates HERBS improves precision and reduces false positives on fine-grained subcategories within the CUB-200-2011 dataset.

Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.