[论文解读] DALI: a large Dataset of synchronized Audio, LyrIcs and notes, automatically created using teacher-student machine learning paradigm
DALI 引入了一个大型多模态数据集,包含 5358 首歌,具时间对齐的人声与歌词,自动通过迭代的教师-学生学习框架创建,以提升 Singing Voice Detection 与音频-歌词对齐。
The goal of this paper is twofold. First, we introduce DALI, a large and rich multimodal dataset containing 5358 audio tracks with their time-aligned vocal melody notes and lyrics at four levels of granularity. The second goal is to explain our methodology where dataset creation and learning models interact using a teacher-student machine learning paradigm that benefits each other. We start with a set of manual annotations of draft time-aligned lyrics and notes made by non-expert users of Karaoke games. This set comes without audio. Therefore, we need to find the corresponding audio and adapt the annotations to it. To that end, we retrieve audio candidates from the Web. Each candidate is then turned into a singing-voice probability over time using a teacher, a deep convolutional neural network singing-voice detection system (SVD), trained on cleaned data. Comparing the time-aligned lyrics and the singing-voice probability, we detect matches and update the time-alignment lyrics accordingly. From this, we obtain new audio sets. They are then used to train new SVD students used to perform again the above comparison. The process could be repeated iteratively. We show that this allows to progressively improve the performances of our SVD and get better audio-matching and alignment.
研究动机与目标
- 提供一个大型、公开可用的多模态数据集,包含同步的音频、歌词和人声旋律注释,具有四个层级的歌词粒度。
- 描述一个迭代的教师-学生学习框架,提升 singing voice detection (SVD) 和音频注释对齐。
- 证明通过教师-学生学习利用 imperfect 但规模更大的训练数据可以获得更好的跨数据集泛化能力。
- 展示自动候选音频检索与对齐如何为音乐信息检索(MIR)研究扩展数据集规模。
提出的方法
- 收集缺少精确音频版本的手工基于卡拉OK的注释(时间、音符、文本)。
- 从 WASABI 关联的歌信息与 YouTube 检索候选音轨;通过将唱歌声音概率序列与注释语音序列对齐以 NCC 选择最佳匹配。
- 用基于卷积网络的 SVD 系统在带标签数据上训练,计算唱歌声音概率;通过优化偏移量 o 和帧率 fr 以 NCC 及穷举搜索对齐注释。
- 迭代训练一个“学生”SVD,在更大匹配集上提升 p̂(t) 并重新扩展数据集,形成教师-学生循环以提升对齐质量。
实验结果
研究问题
- RQ1一个从卡拉OK注释和网络音频候选中创建出同步音频、歌词和注释的多模态大数据集是否可行?
- RQ2教师-学生学习范式是否能在跨数据集测试中优于单次传递系统的歌声检测和对齐质量?
- RQ3用不完美但规模较大的数据与小规模高质量标注数据相比,SVD 模型的跨数据集泛化表现如何?
- RQ4改进的 SVD 对 DALI 数据集规模与质量有何影响?
主要发现
| SVD system | Test_set | J_test (16) | M_test (36) |
|---|---|---|---|
| Teacher_J_train (61) | J_test | 87% | 82% |
| Student (Teacher_J_train) (2673) | J_test | 82% | 82% |
| Teacher_M_train (98) | M_test | 76% | 85% |
| Student (Teacher_M_train) (1596) | M_test | 80% | 84% |
| Teacher_J+M_train (159) | J_test | 82% | 82% |
| Student (teacher_J+M_train) (2440) | J_test | 86% | 87% |
- DALI 包含 5358 首歌,具完整音频和时间对齐的歌词与 vocal notes,覆盖四个粒度层级。
- 基于卷积网络的歌声检测器(teacher)通过最大化与注释语音序列(avs)的互相关来选择音频候选。
- 教师-学生实验表明学生在跨数据集测试中通常优于教师(如 Jamendo 与 MedleyDB)。
- 跨数据集评估显示在一个教师输出上训练的学生在另一数据集上可达到更高的帧级准确率(如在 J_test 与 M_test 上均为 86.5%)。
- 数据集创建方法在此背景下受益于使用更大但不完美的数据来替代更小但完美数据的深度学习。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。