QUICK REVIEW

[论文解读] Parameter Efficient Multimodal Transformers for Video Representation Learning

Sang-Ho Lee, Youngjae Yu|arXiv (Cornell University)|Dec 8, 2020

Music and Audio Processing参考文献 82被引用 36

一句话总结

本文提出端到端可训练的多模态变换器，用于音视频视频表示，采用强参数共享和低秩分解，在参数减少高达97%的情况下，并引入内容感知负采样和融合策略分析。

ABSTRACT

The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the parameters of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme based on low-rank approximation. We show that our approach reduces parameters of the Transformers up to 97$\%$, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns together with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classification tasks.

研究动机与目标

动机：使用变换器从未标记的视频中学习长期音视频表征。
通过减少内存需求和参数规模来实现端到端训练。
研究跨变换器和层的参数共享方案。
提出有效的负采样方法以提升自监督跨模态学习。
评估融合策略并展示对短视频和长视频下游任务的迁移能力。

提出的方法

一个三部分模型：用于短期特征的视觉/音频CNN、用于长期上下文的单模态Transformer，以及用于跨模态上下文的多模态Transformer。
通过跨模态和跨层的共享低秩变换器权重实现参数缩减，分解为 W=UΣVᵀ，U是共享的，ΣVᵀ是私有的。
使用位置BOS标记和时间嵌入来在单模态流中保持时间顺序。
多模态Transformer中的模态共享和时间共享嵌入以实现跨模态融合。
自监督预训练包含两个任务：使用InfoNCE的掩蔽嵌入预测(MEP)和跨模态对应性的正确配对预测(CPP)。
内容感知负采样(CANS)在一个小批中基于CNN嵌入相似性来选择负样本。

实验结果

研究问题

RQ1一个参数高效的变换器架构是否能够从零开始学习端到端的音视频表示？
RQ2跨模态融合策略如何影响多模态表示学习及下游性能？
RQ3跨模态与跨层共享变换器权重对模型大小和精度的影响？
RQ4内容感知负采样是否改善多模态视频表示的自监督学习？
RQ5预训练的多模态表示如何迁移到短时和长时视频分类任务？

主要发现

Transformer参数最多减少97%，性能损失不显著（128M降至4M，采用Part共享）。
中层融合始终表现出强烈的音视频性能及对缺失模态的鲁棒性。在消融实验中，中层在音视频分类上达到65.7% 的 top-1 和 89.9% 的 top-5，在某些设置中优于早期和晚期融合。
内容感知负采样(CANS-Similar)改善MEP，在表1的多模态结果中达到67.5% top-1和92.3% top-5。
跨层权重共享是有效的；跨层共享不会降低性能，使模型更小、更快。
在Kinetics-700或AudioSet上进行中层融合和CANS-Similar的预训练，获得强劲的短时与长时音视频结果，在多个数据集上超过若干基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。