QUICK REVIEW

[論文レビュー] TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer

Sicong Huang, Qiyang Li|arXiv (Cornell University)|Nov 22, 2018

Music and Audio Processing被引用数 67

ひとこと要約

TimbreTronはCycleGANでログ-CQTスペクトログラム上の画像風スタイル転送を適用し、条件付きWaveNetで高品質な音声を再構築します。CQTベースの音色転送は、内容を保持しつつ音色を転送する点でSTFTベースの手法より優れていることを示します。

ABSTRACT

In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, a method for musical timbre transfer which applies "image" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer. We show that the Constant Q Transform (CQT) representation is particularly well-suited to convolutional architectures due to its approximate pitch equivariance. Based on human perceptual evaluations, we confirmed that TimbreTron recognizably transferred the timbre while otherwise preserving the musical content, for both monophonic and polyphonic samples.

研究の動機と目的

Motivate musical timbre transfer as an image-style transfer problem on time-frequency representations.
Explore the Constant Q Transform (CQT) as a space that supports pitch-equivariant convolutions for timbre manipulation.
Develop a three-stage TimbreTron pipeline: CQT extraction, CycleGAN-based timbre transfer in log-CQT domain, and WaveNet-based waveform reconstruction.
Show that CQT-based TimbreTron yields perceptually better timbre transfer than STFT-based variants through human studies.

提案手法

Compute log-magnitude CQT spectrograms from audio and treat them as images for style transfer.
Apply CycleGAN with full-spectrogram discriminator, gradient penalty, and identity loss to transfer timbre in the log-CQT domain.
Train a 40-layer conditional WaveNet to reconstruct waveform from generated log-CQT with nearest-neighbor upsampling and mu-law quantization.
Use autoregressive WaveNet with beam search to better match the target CQT while generating audio.
Optionally generate waveforms in reverse order to mitigate onset-related artifacts during forward generation.]
research_questions

実験結果

リサーチクエスチョン

RQ1Can CQT-based representations facilitate accurate timbre transfer across instruments while preserving pitch, rhythm, and loudness?
RQ2Does CycleGAN-based timbre transfer on log-CQT spectrograms outperform STFT-based approaches in perceptual quality?
RQ3How well can a WaveNet vocoder reconstruct high-quality audio from generated log-CQT representations?
RQ4Does the TimbreTron pipeline generalize across instrument pairs and from MIDI to real-world audio?
RQ5What ablations of CycleGAN components affect timbre transfer quality and musical content preservation?

主な発見

TimbreTron achieves recognizably transferred timbre while preserving musical content in both monophonic and polyphonic cases.
CQT-based TimbreTron shows qualitatively better timbre transfer than STFT-based variants in human studies.
Ablation studies indicate improvements from full-spectrogram discriminators, gradient penalty, and identity loss.
CQT representations enable more reliable pitch transfer and timbre manipulation than STFT, with fewer pitch permutation artifacts.
Generalization experiments demonstrate plausible transfer when training on MIDI data and testing on real-world audio.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。