QUICK REVIEW

[論文レビュー] Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Curtis Hawthorne, Andriy Stasyuk|arXiv (Cornell University)|Oct 29, 2018

Music and Audio Processing被引用数 149

ひとこと要約

本論文は Wave2Midi2Wave を紹介する。離散的なノートイベントを用いたピアノ音楽モデリングの因子化パイプラインで、MAESTRO データセットによって実現され、長期的な音楽構造を持つオーディオの転写・生成・合成を可能にする。併せて、訓練と評価のための大規模な整合された音声-MIDI データセット MAESTRO の公開も行う。

ABSTRACT

Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.1 ms to ~100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.

研究の動機と目的

複数の時間スケールにわたってピアノ音楽をモデル化するための中間表現として、離散的なノート事件の利用を動機づける。
転写、言語モデリング、および条件付き音声合成（Wave2Midi2Wave）を含む因子化アーキテクチャを提案する。
転写、モデリング、および合成タスクの監督付き学習を可能にする、大規模で整合性の高い MAESTRO データセットを公開する。
MAPS で最先端のピアノ転写を実証し、MIDI データに誘導された一貫したピアノ生成と合成を示す。

提案手法

3 コンポーネントのシステムを定義する： (i) 音声を象徴的な MIDI ノートへ写像するエンコーダ（Onsets and Frames 転写）; (ii) 自己注意型のミュージック・ランゲージモデルを用いて MIDI ノート列をモデリングするプリオリティ（Prior）; (iii) MIDI に条件付けられた WaveNet を用いて MIDI から音声をレンダリングするデコーダ。
research_questions
3〜5 の具体的な研究課題が本論文で検討される
1) MIDI を中間表現として用いる因子化パイプラインは、極めて長い時間スケールにわたって一貫したピアノ音楽を再現できるか？ 2) 大規模で整合性の高い MAESTRO データセットの公開は、最先端の転写と言語・合成モデルの効果的な訓練を可能にするか？ 3) 転写済みまたはグランドトゥルース MIDI を条件とする WaveNet は、エンドツーエンド手法と比較して音質がどうか？ 4) フレームワークはより長い音楽構造（約1分程度）にスケールし、未知の演奏に一般化できるか？ 5) アプローチを他の楽器または複数楽器のセットアップへ拡張できるか？
key_findings ab
1) The system combines transcription, a language model, and a MIDI-conditioned WaveNet to produce about one minute of coherent piano music.
2) MAESTRO contains over 172 hours of aligned audio and MIDI, with approximately 3 ms alignment accuracy.
3) The modified Onsets and Frames transcription model achieves state-of-the-art results on a piano transcription benchmark (MAPS) under configured settings.
4) Music Transformer models trained on MAESTRO and MAESTRO-T achieve competitive validation negative log-likelihoods.
5) WaveNet conditioned on MIDI can reproduce timbral and room characteristics and yields perceptually realistic outputs in listening tests.
6) Listening tests show significant differences between sources, with real recordings comparable to some WaveNet conditioned outputs in perceived realism.
table_headers:[]
table_rows:[]} }```json
table_headers
table_rows

実験結果

リサーチクエスチョン

RQ11) Can a factorized pipeline using MIDI as an intermediate representation reproduce coherent piano music across very long timescales?
RQ22) Does releasing a large, well-aligned MAESTRO dataset enable state-of-the-art transcription and effective training of language and synthesis models?
RQ33) How does transcribed or ground-truth MIDI conditioned WaveNet compare in audio quality to end-to-end approaches?
RQ44) Can the framework scale to longer musical structures (up to ~1 minute) and generalize to unseen performances?
RQ55) Can the approach be extended to other instruments or multi-instrument setups?

主な発見

The system combines transcription, a language model, and a MIDI-conditioned WaveNet to produce about one minute of coherent piano music.
MAESTRO contains over 172 hours of aligned audio and MIDI, with approximately 3 ms alignment accuracy.
The modified Onsets and Frames transcription model achieves state-of-the-art results on a piano transcription benchmark (MAPS) under configured settings.
Music Transformer models trained on MAESTRO and MAESTRO-T achieve competitive validation negative log-likelihoods.
WaveNet conditioned on MIDI can reproduce timbral and room characteristics and yields perceptually realistic outputs in listening tests.
Listening tests show significant differences between sources, with real recordings comparable to some WaveNet conditioned outputs in perceived realism.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。