[论文解读] Learning Deep Bilinear Transformation for Fine-grained Image Representation
引入深度双线性变换(DBT)块,在语义分组的特征通道内学习组内双线性交互,使 CNN 能进行深层堆叠,降低复杂度,并在若干细粒度基准上达到最先进水平。
Bilinear feature transformation has shown the state-of-the-art performance in learning fine-grained image representations. However, the computational cost to learn pairwise interactions between deep feature channels is prohibitively expensive, which restricts this powerful transformation to be used in deep neural networks. In this paper, we propose a deep bilinear transformation (DBT) block, which can be deeply stacked in convolutional neural networks to learn fine-grained image representations. The DBT block can uniformly divide input channels into several semantic groups. As bilinear transformation can be represented by calculating pairwise interactions within each group, the computational cost can be heavily relieved. The output of each block is further obtained by aggregating intra-group bilinear features, with residuals from the entire input features. We found that the proposed network achieves new state-of-the-art in several fine-grained image recognition benchmarks, including CUB-Bird, Stanford-Car, and FGVC-Aircraft.
研究动机与目标
- Motivate and address the high computational cost of traditional bilinear pooling for fine-grained recognition.
- Propose a deep bilinear transformation (DBT) block that learns bilinear interactions within semantic groups.
- Enable deep stacking of DBT blocks in CNNs without increasing feature dimensions.
- Demonstrate state-of-the-art performance on multiple fine-grained datasets by integrating DBT into CNN backbones.
提出的方法
- Introduce semantic grouping to uniformly partition input channels into G groups based on semantic information.
- Apply intra-group bilinear transformation within each semantic group to capture discriminative pairwise interactions.
- Aggregate intra-group bilinear features across groups while preserving group order with group index encoding.
- Use a residual connection that fuses the original features with the bilinear features and apply tanh activation before fusion.
- Integrate the DBT block into ResNet-like architectures to form a DBTNet, with training losses including semantic grouping constraints.
实验结果
研究问题
- RQ1Can semantic-guided grouping enable effective and efficient bilinear interactions within deep CNNs for fine-grained recognition?
- RQ2Does deep stacking of DBT blocks yield gains over baseline CNNs and existing bilinear pooling methods on standard fine-grained datasets?
- RQ3What is the impact of semantic grouping loss, group index encoding, and residual connections on performance and optimization?
- RQ4How does DBTNet compare to state-of-the-art bilinear and second-order pooling methods on CUB-200-2011, Stanford-Car, FGVC-Aircraft, and large-scale iNaturalist?
主要发现
| 方法 | 维度 | CUB-200-2011 | Stanford-Car | Aircraft |
|---|---|---|---|---|
| Compact Bilinear | 14k | 81.6 | 88.6 | 81.6 |
| Kernel Pooling | 14k | 84.7 | 91.1 | 85.7 |
| iSQRT-COV | 8k | 87.3 | 91.7 | 89.5 |
| iSQRT-COV | 32k | 88.1 | 92.8 | 90.0 |
| DBTNet-50 (ours) | 2k | 87.5 | 94.1 | 91.2 |
| DBTNet-101 (ours) | 2k | 88.1 | 94.5 | 91.6 |
- DBTNet achieves new state-of-the-art results on CUB-200-2011, Stanford-Car, and FGVC-Aircraft when integrated into deep CNNs.
- DBTNet-50 (2k-dim last-layer bilinear feature) attains 87.5% (CUB-200-2011), 94.1% (Stanford-Car), and 91.2% (Aircraft).
- DBTNet-101 (2k-dim last-layer bilinear feature) attains 88.1% (CUB-200-2011), 94.5% (Stanford-Car), and 91.6% (Aircraft).
- Compared to Compact Bilinear, Kernel Pooling, and iSQRT-COV, DBTNet shows clear accuracy advantages across the three fine-grained datasets.
- Large-scale results on iNaturalist-2017 show DBTNet-50 outperforming ResNet-50 by 2.1% and also yielding gains on ImageNet when using the DBT approach.
- The method remains efficient with modest FLOPs (example configurations report ~3.8B FLOPs for baseline and ~7.6B for larger DBT models).
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。