QUICK REVIEW

[論文レビュー] Music Transformer

Cheng-Zhi Anna Huang, Ashish Vaswani|arXiv (Cornell University)|Sep 12, 2018

Music and Audio Processing被引用数 48

ひとこと要約

リメモリ効率の相対的注意を用いた Transformer モデルが長期の音楽構造を捉え、長いシーケンスを可能にし品質を向上させる。JSB Chorales および Piano-e-Competition で評価。

ABSTRACT

Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter.

研究の動機と目的

Transformer が複数の時間スケールにまたがる長距離で反復する構造を持つ音楽を生成できることを示す。
相対的なタイミング情報（および任意のピッチ情報）を用いて Transformer を拡張し、音楽的関係のモデル化を改善する。
相対注意のメモリコストを削減して長いシーケンスでの訓練を可能にする。
音楽のシーケンス表現を（楽譜のようなデータと演奏のようなデータとして）紹介し、無条件生成とメロディ条件付き伴奏の両方を評価する。

提案手法

中間メモリを O(L^2 D) から O(LD) に削減するメモリ効率の良い相対自己注意機構と、相対ロジットをそろえるためのスキュー手順を導入する。
JSB Chorales を16分音符グリッドとして、Piano-e-Competition をMIDI様のイベントベースのトークンとして、データセットに適したエンコードを用いて音楽をトークン列として表現する。
相対注意にタイミング関係（および任意でピッチ）を位置間の関係として含め、S^rel ロジットを注意機構に統合する。
JSB Chorales および Piano-e-Competition でグローバルおよびローカル相対注意を用いた実験を行い、ベースラインおよび既存モデルと比較する。
人間の聴取テストを実施し、メロディによるプライミング/コンディショニングを分析して一貫性と音楽性を評価する。

実験結果

リサーチクエスチョン

RQ1相対自己注意はシンボリック音楽データセット上で、ベースライン Transformer よりもパープレキシティとサンプル品質を改善するか。
RQ2モデルは訓練シーケンスより長い、首尾一貫した長期的な音楽構造と継続を生成できるか。
RQ3タイミング（およびピッチ）関係を取り入れると、性能と一般化能力が向上するか。
RQ4メロディに条件付けられた伴奏を生成する seq2seq 設定で、モデルはどのように機能するか。

主な発見

相対注意は JSB Chorales データセットでベースライン Transformer と比較して負の対数尤度とサンプルの一貫性を改善する。
Piano-e-Competition では、相対注意を備えた Transformer が最先端のパープレキシティを達成し、ベースラインモデルを上回る。
メモリ効率の良い実装により中間メモリ使用量を削減して長いシーケンス（数千ステップ）での訓練を可能にし、より長い作曲を実現する。
相対的に注意されたモデルはモチーフのプライミングと継続をより良く示し、長期構造とリズム感のあるフレージングを保持する。
メロディ条件付き設定では、相対 Transformer がベースラインより良い条件付き NLL を示す。
人間の評価は、相対注意モデルがベースラインと比較して音楽性の知覚的改善を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。