QUICK REVIEW

[论文解读] Res-VMamba: Fine-Grained Food Category Visual Classification Using Selective State Space Models with Deep Residual Learning

Chi-Sheng Chen, Guanying Chen|arXiv (Cornell University)|Feb 24, 2024

Nutritional Studies and Diet被引用 12

一句话总结

本文提出 Res-VMamba，这是一种将全局残差学习与选择性状态空间模型结合的残差 VMamba 基于的模型，用于在 CNFOOD-241 上实现细粒度食品分类并达到在无预训练权重情况下的最新结果。

ABSTRACT

Food classification is the foundation for developing food vision tasks and plays a key role in the burgeoning field of computational nutrition. Due to the complexity of food requiring fine-grained classification, recent academic research mainly modifies Convolutional Neural Networks (CNNs) and/or Vision Transformers (ViTs) to perform food category classification. However, to learn fine-grained features, the CNN backbone needs additional structural design, whereas ViT, containing the self-attention module, has increased computational complexity. In recent months, a new Sequence State Space (S4) model, through a Selection mechanism and computation with a Scan (S6), colloquially termed Mamba, has demonstrated superior performance and computation efficiency compared to the Transformer architecture. The VMamba model, which incorporates the Mamba mechanism into image tasks (such as classification), currently establishes the state-of-the-art (SOTA) on the ImageNet dataset. In this research, we introduce an academically underestimated food dataset CNFOOD-241, and pioneer the integration of a residual learning framework within the VMamba model to concurrently harness both global and local state features inherent in the original VMamba architectural design. The research results show that VMamba surpasses current SOTA models in fine-grained and food classification. The proposed Res-VMamba further improves the classification accuracy to 79.54\% without pretrained weight. Our findings elucidate that our proposed methodology establishes a new benchmark for SOTA performance in food recognition on the CNFOOD-241 dataset. The code can be obtained on GitHub: https://github.com/ChiShengChen/ResVMamba.

研究动机与目标

将细粒度食品分类作为具有高类内变异和低类间变异的挑战性 FGVC 任务进行动机阐述。
提出 Res-VMamba，一种在全局与局部状态特征共享的同时引入残差的 VMamba 模型，以提升准确性。
在 CNFOOD-241 上进行评估，以在无预训练权重的情况下建立食品识别的新 SOTA 基准。

提出的方法

在 VMamba 中引入残差学习机制，形成带全局残差路径的 Res-VMamba，能够将原始输入与 VSS 块特征进行融合。
详细说明状态空间模型（SSM）框架及其在深度学习中的离散化，包括 A、B、C、D 矩阵及零阶保持近似。
描述 2D 选择性扫描机制（S6）与跨扫描模块（CSM），实现多方向的补丁排序与全局-局部特征整合。
阐述具有分层阶段（4 个阶段）并通过补丁合并进行下采样以模拟多尺度表示的 VMamba 主干结构。
呈现带全局残差连接并输入到四阶段 VSS 块的 Res-VMamba 架构，实现全局图像特征与局部处理的共享。
提供训练协议设置，包括 AdamW、余弦学习率调度、标签平滑、EMA 以及 CNFOOD-241 的数据处理。

实验结果

研究问题

RQ1 VMamba 基于架构是否能够在无需预训练权重的情况下在细粒度食品数据集上达到最新性能？
RQ2将全局残差机制引入与 VMamba 搭配是否能提升细粒度食品分类？
RQ3在 CNFOOD-241 上，Res-VMamba 相较于其他 SOTA 模型在 top-1 和 top-5 精度方面有何对比？
RQ4数据集特征（统一图像大小、类别不平衡）对 CNFOOD-241 模型性能有何影响？

主要发现

模型	年份	使用 PW？	Top-1 验证准确率	Top-5 验证准确率	Top-1 测试准确率	Top-5 测试准确率
VGG16 Simonyan and Zisserman ( 2015 )	2014	Y	66.98	90.10	65.06	89.60
ViT-B Dosovitskiy et al. ( 2021 )	2020	Y	73.14	92.06	71.58	91.62
ResNet101 He et al. ( 2016 )	2015	Y	74.42	93.62	72.59	93.16
DenseNet121 Huang et al. ( 2017 )	2016	Y	76.46	94.57	74.77	94.29
Inceptionv4 Szegedy et al. ( 2016 )	2016	Y	77.30	94.28	75.70	93.89
PRENet Min et al. ( 2023 )	2017	Y*	77.47	94.86	76.02	94.61
SEnet154 Hu et al. ( 2018 )	2017	Y	77.47	94.86	76.02	94.61
RepViT Wang et al. ( 2023 )	2023	Y	78.08	95.41	76.86	95.02
ConvNeXT-B Liu et al. ( 2022 )	2022	Y	78.30	94.36	76.76	93.90
EfficientNet-B6 Mingxing Tan ( 2019 )	2019	Y	80.10	94.64	78.48	94.22
CMAL-Net Liu et al. ( 2023 )	2023	Y †	80.16	95.99	78.56	95.40
VMamba-S Liu et al. ( 2024a )	2024	N	79.17	95.64	77.73	95.24
Res-VMamba (ours)	2024	N	79.54	95.72	78.26	95.31
VMamba-S \| VMamba-S (pretrained)	2024	Y ‡	82.15	96.91	80.58	96.71

Res-VMamba 在 CNFOOD-241 上在无预训练权重的情况下实现了 78.26% 的 top-1 测试准确率。
在带预训练权重的情况下，VMamba-S 实现了 80.58% 的 top-1 测试准确率，而无预训练权重的 Res-VMamba 达到 78.26% 的 top-1 测试准确率，显示出较强的无预训练性能。
Res-VMamba 相比无预训练权重的 VMamba-S 在 top-1 准确率上提升了 0.53%。
在 CNFOOD-241 上，VMamba-S 使用 ImageNet-1K 预训练权重时达到 80.58% 的 top-1 测试准确率，且带预训练的 VMamba-S 优于若干基线。
与多项基线相比，Res-VMamba（我们的方法）在 SOTA 方法中表现具有竞争力，并且相比未预训练的 VMamba 显示出改进。
CNFOOD-241 数据集的高分辨率和不平衡性带来具有挑战性的基准，Res-VMamba 为食品识别建立了新的性能基准。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。