In this post, we introduce MelGAN, a new generative model of raw audio waveforms created by the Lyrebird team that is capable of generating natural sounding speech at a rate of more than 2,500,000 audio samples per second — more than 100x faster than real time, and 10x faster than alternative methods on similar hardware.
We believe that MelGAN paves the way for taking many real-time speech applications onto smaller devices. Imagine, for example, in the not too distant future, having real-time text-to-speech translation on your mobile device without the internet. And, it’s application to music translation brings us one step closer to AI-assisted music composing.
We’ve open sourced MelGAN and we encourage interested machine learning developers and researchers to check out our code base.
MelGAN in Sum
What is MelGAN?
MelGAN is a non-autoregressive feed-forward convolutional neural network architecture to perform audio waveform generation in a GAN setup. It’s the first of its kind for spectrogram inversion — and it’s a lot faster, and a lot smaller, without being any less effective.
Mel stands for Mel Spectrogram, a way of visualizing sound as a Spectrogram in Mel Scale. The Mel Scale converts sound into numbers so that the distance between the numbers matches the distance as it registers to the human ear. It is a “perceptual” scale, where each tone in Hz has a perceptual pitch on the Mel Scale.
GAN stands for generative adversarial networks (GANs). A typical GAN model consists of a generator and a discriminator. The generator synthesizes fake data samples while the discriminator distinguishes between generated data and real data. It’s like a counterfeiter and a detective: the counterfeiter produces fake notes while the detective tries to detect them. As the discriminator is trained to minimize its errors in detecting fake samples, the generator is simultaneously trained to fool the discriminator with improved synthesized data quality. Both the counterfeiter and the detective improve at their tasks, and the fake data samples improve over time.
GANs have made steady progress in unconditional image generation, image-to-image translation, and video-to-video synthesis. Here’s an example of a GAN generating Homer Simpson. GANs have not made as much progress in audio modelling. Many researchers have attempted to train GANs to generate speech waveform, but these models haven’t been able to perform nearly as well as the best existing alternatives.
MelGAN is first, faster, smaller, still effective
- First: To the best of our knowledge, our work on MelGAN is the first work that successfully trains GANs to convert spectrogram to raw audio without additional distillation or perceptual loss functions and still yields a high quality text-to-speech model.
- Faster: MelGAN is 10 times faster than the fastest available spectrogram inversion model to date when compared on similar hardware.
- Smaller: Since MelGAN has many fewer parameters as compared to competing models, it is one of the few models that can afford to run real-time processing on a CPU.
- Effective: And, we achieve this speed without considerable degradation in audio quality.
The Text-to-Speech Challenge: Model 16,000 Samples per Second
Impressive developments in text-to-speech
In recent years, the technology of generating natural sounding speech with machines — a process usually known as speech synthesis or text-to-speech (TTS) — has evolved tremendously. Notably, in 2016, the quality of machine-generated audio improved significantly when researchers at DeepMind successfully applied deep neural networks to model raw audio waveforms directly. We have witnessed the emergence of a new wave of audio applications utilizing deep learning solutions such as voice assistants and our very own speech correction tool.
However, for real-time applications like live message readers or online speech translation, these TTS systems need to get even better. To reconstruct sufficient details in speech signals, TTS systems typically generate at a rate of at least 16,000 samples per second — a sample rate that has imposed extreme hardware constraints and prohibited many applications from being run on device.
MelGAN solves this hardware constraint. We built it with the goal of enabling raw audio waveform generation at high speed with minimal compute requirements, and it works. Without any specialized hardware for parallel processing, MelGAN is capable of generating speech samples faster than real time. The model achieves all of this while preserving very high output speech quality, and has the potential to extend to more structured data types, like music. We share examples of each in this post.
How MelGAN works
Modelling raw audio in speech is a challenging problem not just because of the high temporal resolution of the data (at least 16,000 samples per second) — the model must also contend with the structures that exist at different timescales such as speech reverberation, pitch, and noise profile. Thus, instead of modelling the raw temporal audio directly based on text, many modern TTS systems (Char2Wav, Tacotron2) simplify the problem by breaking it into a two-step process:
1. Modelling a lower-resolution representation such as a mel-spectrogram sequence conditional on text
2. Modelling raw audio waveforms conditional on that mel-spectrogram sequence (or another intermediate representation)
The MelGAN generator is designed to enable an efficient audio sampling process in that second level of the TTS pipeline. Many state-of-the-art audio models are autoregressive in sampling — each audio sample has a direct causal dependency on a previous block of audio samples. This process provides the information required by the model to produce coherent output waveform, but it typically requires audio samples to be generated one at a time. As the WaveNet authors point out, “Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio.”
With MelGAN, it’s no longer essential to create these samples one at a time. Instead of the direct causal dependency in WaveNet, the relationship between each sample pair in MelGAN is implicit. By using a large number of dilated convolution blocks, even the samples that are temporally far share a significant amount of overlapping input mel-spectrogram frames and hidden layer nodes, yet the samples do not depend directly on one another. As a result, audio sample generation with MelGAN is completely parallelizable.
10x faster and small enough to run real-time processing on a CPU
The MelGAN generator is capable of sampling at a frequency of 2500kHz on GTX1080 Ti GPU in full precision, which is more than 10x faster than the fastest competing model on the same hardware, and 100x faster than real time. Even more importantly, MelGAN is one of the few models, across all comparable ones in terms of output quality, that can afford real-time processing on a typical CPU. The model is also significantly smaller than competing models in terms of number of parameters:
Despite having so many fewer parameters — more than 5 million fewer than the next smallest model — MelGAN is still capable of yielding speech samples with very realistic timbre and clean sound profile. Below you can find a few completely synthetic examples generated with a TTS system using MelGAN. We trained the whole pipeline with 15 hours of audio recordings from each speaker.
MelGAN’s three discriminators
In MelGAN, we use three discriminators operating at different time scales. One discriminator operates on the scale of raw audio, whereas the other two operate on raw audio downsampled by a factor of 2 and 4, respectively. This structure has an inductive bias that each discriminator learns features for a different frequency range of the audio. For example, the discriminator operating on 4 downsampled audio does not have access to the high frequency component, hence, it is biased to learn discriminative features based on low frequency components only. This ensures that the error signals received by the generator from the discriminators are more balanced across multiple hierarchies of audio modelling.
Each individual discriminator is a window-based discriminator. This is analogous to recognizing a picture by image patches. While a standard GAN discriminator learns to classify between distributions of entire audio sequences, a window-based discriminator learns to classify between distributions of small audio chunks. We chose window-based discriminators since they have been shown to capture essential high frequency structures, run faster, and can be applied to variable-length audio sequences.
Easily adapted to music translation and beyond
Besides modelling speech waveforms, we’re excited to share that MelGAN architecture can be easily adapted for musical applications such as translating musical instrument performance from one genre to another.
To create the samples below, we trained a MelGAN model with audio recordings from violin accompanied musical performances. We then encoded audio recordings of a cello performance using the music encoder from Universal Music Translation Network.When it is decoded back with the MelGAN generator, we can easily pick up the change in instrument sound characteristics:
This is certainly a cool application in itself, but what’s even more exciting is that this takes us one step closer to AI-assisted music composing.
Open source code
To contribute further to the growth of deep learning research in audio domain, we’ve decided to open source the MelGAN model. We encourage interested machine learning developers and researchers to check out our code base and we’re excited to see what people will build with it!
MelGAN was born out of a research project at Lyrebird led by Kundan Kumar and Rithesh Kumar, with contribution from Thibault de Boissière, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo and Alexandre de Brébisson. This work was done in collaboration with Aaron Courville and Yoshua Bengio, professors in deep learning at Mila.