QUICK REVIEW

[논문 리뷰] Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation

Botao Yu, Peiling Lu|arXiv (Cornell University)|2022. 10. 19.

Music Technology and Sound Studies인용 수 20

한 줄 요약

뮤즈포머는 Transformer 기반의 기호 음악 생성에서 미세-와 거친(대략) 주의 메커니즘을 도입하여 긴 시퀀스 모델링과 더 나은 음악적 구조를 가능하게 하면서 복잡성을 줄입니다. 이는 미세한 주의가 필요한 구조 관련 마디를 선택하고, 다른 마디의 요약을 사용하여 효율성과 품질을 달성합니다.

ABSTRACT

Symbolic music generation aims to generate music scores automatically. A recent trend is to use Transformer or its variants in music generation, which is, however, suboptimal, because the full attention cannot efficiently model the typically long music sequences (e.g., over 10,000 tokens), and the existing models have shortcomings in generating musical repetition structures. In this paper, we propose Museformer, a Transformer with a novel fine- and coarse-grained attention for music generation. Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures (e.g., the previous 1st, 2nd, 4th and 8th bars, selected via similarity statistics); with the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the computational cost. The advantages are two-fold. First, it can capture both music structure-related correlations via the fine-grained attention, and other contextual information via the coarse-grained attention. Second, it is efficient and can model over 3X longer music sequences compared to its full-attention counterpart. Both objective and subjective experimental results demonstrate its ability to generate long music sequences with high quality and better structures.

연구 동기 및 목표

전체 자기 주의(full self-attention)의 한계를 넘는 기호 음악 생성에서의 긴 시퀀스 모델링 문제를 다룬다.
반복 및 장거리 의존성 등 음악 구조를 보다 효과적으로 모델링한다.
생성에 필요한 핵심 정보를 보존하면서 계산 및 메모리 복잡성을 줄인다.

제안 방법

FC-Attention을 제안한다: 구조 관련 마디에는 미세-주의를, 다른 마디의 요약에는 거친-주의를 적용한다.
로컬 집계를 촉진하기 위해 각 마디 뒤에 요약 토큰을 삽입한다.
인간이 만든 음악의 마디 간 유사성 통계로 구조 관련 마디를 선택한다.
FC-Attention 내에서 이중 단계 요약 및 집계 프로세스를 통해 토큰 표현을 업데이트한다.
마디를 마디-비트 임베딩으로 표현하고 FC-Attention이 적용된 Transformer와 같은 아키텍처를 사용한다.
perplexity와 유사성 오차, 주관적 청취 테스트를 포함해 Lakh MIDI 데이터셋에서 평가한다.

실험 결과

연구 질문

RQ1두 가지 주의 스킴(미세-와 거친-주)이 전체 주의나 다른 긴 시퀀스 트랜스포머보다 긴 음악 시퀀스를 더 잘 모델링할 수 있는가?
RQ2유사성 통계를 통해 선택된 구조 관련 마디가 음악적 구조와 퍼플렉시티를 개선하는가?
RQ3메모리, 속도, 품질 측면에서 Museformer가 전체 곡 길이에 어떻게 확장되는가?

주요 결과

모델	PPL (1024)	PPL (5120)	PPL (10240)	SE (%)
Music Transformer	1.66	1.77	2.55	2.49
Transformer-XL	1.64	1.45	1.43	15.66
Longformer	1.65	1.46	1.45	5.25
Linear Transformer	1.86	1.67	1.64	1.97
Museformer (ours)	1.64	1.41	1.35	0.95
w/o coarse-grained	1.65	1.42	1.38	1.08
w/o bar selection	1.65	1.43	1.39	6.39

Museformer는 비교된 모델들 중에서 1024, 5120, 10240 토큰 시퀀스에서 최상의 perplexity를 달성한다.
생성된 음악 구조가 사람 제작 음악과 유사하다는 것을 시사하는 가장 낮은 유사성 오차를 보인다.
주관적 평가에서 Museformer가 음악성, 단기 구조, 장기 구조, 전반적 선호도에서 가장 높은 점수를 얻었다.
적절한 경우 거친-주의와 바 선택이 성능에 기여하며, 더 긴 시퀀스일수록 구조 관련 바 선장이 더 중요해진다.
Museformer는 전체 주의 기준선 대비 효율성 향상을 통해 전체 곡 길이의 음악을 구성할 수 있게 하며(거의 선형적 메모리 증가 및 3배 이상 긴 시퀀스).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.