QUICK REVIEW

[논문 리뷰] Music Transformer

Cheng-Zhi Anna Huang, Ashish Vaswani|arXiv (Cornell University)|2018. 09. 12.

Music and Audio Processing인용 수 48

한 줄 요약

tldr: 메모리 효율적인 상대 주의(attention)를 갖춘 트랜스포머 모델이 장기적인 음악 구조를 포착하여 긴 시퀀스를 가능하게 하고 품질을 향상시키며, JSB Chorales와 Piano-e-Competition에서 평가되었습니다.

ABSTRACT

Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter.

연구 동기 및 목표

트랜스포머가 다중 시간 척도에 걸쳐 긴 범위의 반복 구조를 가진 음악을 생성할 수 있음을 보여주기
트랜스포머를 상대 타이밍(및 선택적으로 음높이) 정보를 포함하도록 확장하여 음악적 관계의 모델링을 개선하기
상대 주의의 메모리 비용을 감소시켜 긴 시퀀스에 대한 학습을 가능하게 하기
음악의 시퀀싱 표현을 선보이고(악보와 같은 데이터와 연주와 같은 데이터), 무조건 생성과 멜로디-조건 부속 악곡 모두를 평가하기

제안 방법

중간 메모리 O(L^2 D)에서 O(LD)로 감소시키는 메모리 효율적인 상대적 자기 주의 메커니즘과 상대 로짓을 정렬하기 위한 skewing 절차를 도입한다.
데이터세트에 적합한 인코딩을 사용하여 음악을 토큰 시퀀스로 표현한다(JSB Chorales는 16분음표 격자; Piano-e-Competition은 MIDI와 유사한 이벤트 기반 토큰으로).
상대 주의에 위치 간의 타이밍(및 선택적으로 음높이) 관계를 포함하도록 확장하여 S^rel 로짓을 주의 메커니즘에 통합한다.
JSB Chorales와 Piano-e-Competition에서 글로벌 및 로컬 상대 주의를 실험하고 기본 모델 및 이전 모델과 비교한다.
일관성과 음악성을 평가하기 위해 인간 청취 테스트를 수행하고 멜로디에 의한 프라이밍/컨디셔닝을 분석한다.

실험 결과

연구 질문

RQ1상대적 자기 주의가 기호 음악 데이터세트에서 기저 트랜스포머 대비 퍼플렉시티와 샘플 품질을 개선하는가?
RQ2모델이 학습 시퀀스보다 더 긴 일관된 장기 음악 구조와 연속 부분을 생성할 수 있는가?
RQ3타이밍(및 음높이) 관계를 포함하는 것이 성능과 일반화에 더 이득이 있는가?
RQ4멜로디를 조건으로 반주를 생성하는 시퀀스-투-시퀀스 설정에서 모델의 성능은 어떠한가?

주요 결과

상대 주의는 JSB Chorales 데이터셋에서 기저 트랜스포머에 비해 음의 로그 가능도와 샘플 일관성을 향상시킨다.
Piano-e-Competition에서 상대 주의가 있는 트랜스포머는 최첨단 perplexity를 달성하고 기저 모델을 능가한다.
메모리 효율적 구현은 중간 메모리 사용을 줄여 긴 시퀀스(수천 단계) 학습을 가능하게 하여 더 긴 작곡을 가능하게 한다.
상대 주의 모델은 모티프의 프라이밍과 연속에 있어 더 나은 모습을 보여주며, 장기 구조와 규칙적 구절을 보존한다.
멜로디-조건 설정에서 상대 트랜스포머는 조건부 NLL이 기저보다 좋다.
인간 평가에서 상대 주의 모델이 기저 대비 음악성의 지각적 향상을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.