QUICK REVIEW

[论文解读] Efficient N-Dimensional Convolutions via Higher-Order Factorization

Jean Kossaifi, Adrian Bulat|arXiv (Cornell University)|Jun 14, 2019

Tensor decomposition and applications被引用 3

一句话总结

本文提出CP-高阶卷积（HO-CPConv），一种张量分解框架，通过将高阶核分解为低秩分量，实现高效、可分离的N维卷积。该方法统一了模型压缩与架构效率，实现了从静态2D数据到时序3D数据的迁移学习，在AffectNet、SEWA和AFEW-VA数据集上实现了时空面部情绪识别的最先进性能。

ABSTRACT

Training deep neural networks with spatio-temporal (i.e., 3D) or multidimensional convolutions of higher-order is computationally challenging due to millions of unknown parameters across dozens of layers. To alleviate this, one approach is to apply low-rank tensor decompositions to convolution kernels in order to compress the network and reduce its number of parameters. Alternatively, new convolutional blocks, such as MobileNet, can be directly designed for efficiency. In this paper, we unify these two approaches by proposing a tensor factorization framework for efficient multidimensional (separable) convolutions of higher-order. Interestingly, the proposed framework enables a novel higher-order transduction, allowing to train a network on a given domain (e.g., 2D images or N-dimensional data in general) and using transduction to generalize to higher-order data such as videos (or (N+K)-dimensional data in general), capturing for instance temporal dynamics while preserving the learnt spatial information. We apply the proposed methodology, coined CP-Higher-Order Convolution (HO-CPConv), to spatio-temporal facial emotion analysis. Most existing facial affect models focus on static imagery and discard all temporal information. This is due to the above-mentioned burden of training 3D convolutional nets and the lack of large bodies of video data annotated by experts. We address both issues with our proposed framework. Initial training is first done on static imagery before using transduction to generalize to the temporal domain. We demonstrate superior performance on three challenging large scale affect estimation datasets, AffectNet, SEWA, and AFEW-VA.

研究动机与目标

解决因多层中存在数百万参数而导致的高阶（如3D）卷积训练计算负担问题。
克服用于时空情感识别的大规模、专家标注视频数据集匮乏的问题。
通过低秩张量分解统一模型压缩与高效网络设计，以减少参数量和训练成本。
实现从静态2D图像数据到时序3D视频数据的迁移学习，同时保留空间特征并学习时间动态。
在训练过程中无需大量3D视频数据，即可在大规模情感估计基准上实现最先进性能。

提出的方法

提出一种高阶张量分解框架，利用CANDECOMP/PARAFAC（CP）格式将N维卷积核分解为若干秩一张量之和。
应用低秩分解以减少多维卷积中的参数数量，同时保持模型的表征能力。
设计一种新型迁移机制，通过将2D权重的分解结果用于初始化3D卷积核，实现从预训练2D网络到3D网络的知识迁移。
利用分解后的卷积核结构，实现在视频等高阶数据上的高效推理与训练。
将分解后的卷积层整合到适合时空建模的深度学习架构中，共享空间与时间组件。
首先在静态图像数据上端到端训练模型，然后通过迁移学习微调以适应时序数据，而无需从头开始训练。

实验结果

研究问题

RQ1张量分解能否有效压缩N维网络中的高阶卷积核？
RQ2通过核分解的迁移学习方法，能否有效将2D图像数据训练的模型泛化到3D视频数据？
RQ3所提出的CP-高阶卷积框架是否在时空面部情绪识别任务中优于现有方法？
RQ4低秩分解在迁移学习过程中在多大程度上能保持空间与时间表征？
RQ5该框架能否在保持或提升性能的同时降低模型复杂度，适用于大规模情感估计数据集？

主要发现

所提出的CP-高阶卷积（HO-CPConv）框架在三个大规模情感估计数据集（AffectNet、SEWA和AFEW-VA）上实现了最先进性能。
该方法实现了从2D图像数据到3D视频数据的有效迁移学习，使模型能够在无需大规模3D视频标注的情况下学习时间动态。
通过应用低秩张量分解，3D卷积中的参数数量显著减少，提升了计算效率。
模型在静态图像数据上保持高精度，同时在时序数据上也表现出良好的泛化能力，证明了迁移机制的鲁棒性。
该框架在参数效率和性能方面均优于现有方法，尤其在3D视频数据有限的场景下表现更优。
消融实验确认，分解策略与迁移学习流程对模型成功至关重要，当任一组件被移除时性能均出现显著下降。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。