QUICK REVIEW

[论文解读] Audio Captioning Transformer

Xinhao Mei, Xubo Liu|arXiv (Cornell University)|Jul 21, 2021

Music and Audio Processing参考文献 23被引用 32

一句话总结

本论文提出 Audio Captioning Transformer (ACT)，一个用于音频字幕的卷积去除 Transformer 编码器-解码器，预训练于 AudioSet，并在 AudioCaps 上进行评估，结果具有竞争力。

ABSTRACT

Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events. We evaluate our model on AudioCaps, which is the largest audio captioning dataset publicly available. Our model shows competitive performance compared to other state-of-the-art approaches.

研究动机与目标

将音频字幕任务向前推进，通过应用纯 Transformer 编码器-解码器（无卷积）。
使用对时间补丁上的自注意力来建模音频中的全局和细粒度时间信息。
通过在 AudioSet 作为音频标注任务进行预训练来提升泛化能力，并利用 DeiT 初始化。
在 AudioCaps 上将 ACT 与最先进方法进行比较，并分析数据效率和超参数影响。

提出的方法

将对数梅尔谱表示为不重叠的时间补丁，并将它们嵌入到带有 class token 的 Transformer 编码器中以获得全局音频信息。
使用标准的 Transformer 编码器，具有多头自注意力和前馈层，包括层归一化和残差连接。
在解码器中，使用掩蔽自注意力和一个额外的跨注意力层，对编码器输出进行关注，通过线性+softmax 层产生词预测。
在 AudioSet 上将编码器预训练为音频标注任务，以学习通用音频模式，带有用于标注输出的 class token。
使用 Word2Vec 嵌入初始化解码器，并尝试三种在深度和头数上不同的解码器变体。
使用交叉熵损失和教师 forcing 端到端训练，在推理阶段使用束搜索（束宽度最高为 5）。

实验结果

研究问题

RQ1卷积无关的 Transformer 编码器-解码器（ACT）是否能有效捕捉用于字幕的全局与时间信息？
RQ2在大型音频标注数据集 AudioSet 上对编码器进行预训练如何影响字幕性能？
RQ3解码器深度和注意力头数对字幕质量及评估指标有何影响？
RQ4在 AudioCaps 上，ACT 相较于基于 CNN 的和基于 Transformer 的基线方法在准确性和效率方面表现如何？

主要发现

模型	BLEU 1	BLEU 2	BLEU 3	BLEU 4	ROUGE L	METEOR	CIDEr	SPICE	SPIDEr
ACT_s_DeiT_AudioSet	0.643	0.483	0.352	0.249	0.469	0.218	0.669	0.160	0.415
ACT_m_DeiT_AudioSet	0.653	0.495	0.363	0.259	0.471	0.222	0.663	0.163	0.413
ACT_l_DeiT_AudioSet	0.647	0.488	0.356	0.252	0.468	0.222	0.679	0.160	0.420
ACT_m_scratch	0.567	0.411	0.285	0.191	0.417	0.187	0.501	0.127	0.314
ACT_m_DeiT	0.606	0.445	0.319	0.224	0.445	0.207	0.586	0.147	0.367
RNN+RNN [ 3 ]	0.614	0.446	0.317	0.219	0.450	0.203	0.593	0.144	0.369
CNN+RNN [ 6 ]	0.655	0.476	0.335	0.231	0.467	0.229	0.660	0.168	0.414
CNN+Transformer [ 9 ]	0.641	0.479	0.344	0.236	0.469	0.221	0.693	0.159	0.426
CNN+Transformer_scratch [ 9 ]	0.610	0.461	0.334	0.234	0.455	0.206	0.629	0.144	0.386

ACT 在 AudioCaps 上的表现与最先进方法相比具有竞争力。
在 AudioSet 上对编码器进行预训练显著提升性能，单独的 DeiT 初始化也带来显著提升。
编码器预训练对于基于 Transformer 的音频字幕至关重要；从头开始训练在没有预训练的情况下不及 CNN+Transformer。
ACT_m（4 层解码器）在机器翻译指标上表现最佳，而 ACT_l 则提升了 CIDEr 与 SPIDEr 分数。
ACT 模型的训练速度快于 CNN+Transformer（每轮不到五分钟 vs 七分钟）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。