Ultra Fast Audio Synthesis with MelGAN

In this post, we introduce MelGAN, a new generative model of raw audio waveforms created by the Lyrebird team that is capable of generating natural sounding speech at a rate of more than 2,500,000 audio samples per second — more than 100x faster than real time, and 10x faster than alternative methods on similar hardware.

We believe that MelGAN paves the way for taking many real-time speech applications onto smaller devices. Imagine, for example, in the not too distant future, having real-time text-to-speech translation on your mobile device without the internet. And, it’s application to music translation brings us one step closer to AI-assisted music composing.

We’ve open sourced MelGAN and we encourage interested machine learning developers and researchers to check out our code base.

MelGAN in Sum

What is MelGAN?

MelGAN is a non-autoregressive feed-forward convolutional neural network architecture to perform audio waveform generation in a GAN setup. It’s the first of its kind for spectrogram inversion — and it’s a lot faster, and a lot smaller, without being any less effective.

Mel stands for Mel Spectrogram, a way of visualizing sound as a Spectrogram in Mel Scale. The Mel Scale converts sound into numbers so that the distance between the numbers matches the distance as it registers to the human ear. It is a “perceptual” scale, where each tone in Hz has a perceptual pitch on the Mel Scale.  

GAN stands for generative adversarial networks (GANs). A typical GAN model consists of a generator and a discriminator. The generator synthesizes fake data samples while the discriminator distinguishes between generated data and real data. It’s like a counterfeiter and a detective: the counterfeiter produces fake notes while the detective tries to detect them. As the discriminator is trained to minimize its errors in detecting fake samples, the generator is simultaneously trained to fool the discriminator with improved synthesized data quality. Both the counterfeiter and the detective improve at their tasks, and the fake data samples improve over time.

GANs have made steady progress in unconditional image generation, image-to-image translation, and video-to-video synthesis. Here’s an example of a GAN generating Homer Simpson. GANs have not made as much progress in audio modelling. Many researchers have attempted to train GANs to generate speech waveform, but these models haven’t been able to perform nearly as well as the best existing alternatives.

MelGAN is first, faster, smaller, still effective

  • First: To the best of our knowledge, our work on MelGAN is the first work that successfully trains GANs to convert spectrogram to raw audio without additional distillation or perceptual loss functions and still yields a high quality text-to-speech model.
  • Faster: MelGAN is 10 times faster than the fastest available spectrogram inversion model to date when compared on similar hardware.
  • Smaller: Since MelGAN has many fewer parameters as compared to competing models, it is one of the few models that can afford to run real-time processing on a CPU.
  • Effective: And, we achieve this speed without considerable degradation in audio quality.

The Text-to-Speech Challenge: Model 16,000 Samples per Second

Impressive developments in text-to-speech

In recent years, the technology of generating natural sounding speech with machines — a process usually known as speech synthesis or text-to-speech (TTS) — has evolved tremendously. Notably, in 2016, the quality of machine-generated audio improved significantly when researchers at DeepMind successfully applied deep neural networks to model raw audio waveforms directly. We have witnessed the emergence of a new wave of audio applications utilizing deep learning solutions such as voice assistants and our very own speech correction tool.

However, for real-time applications like live message readers or online speech translation, these TTS systems need to get even better. To reconstruct sufficient details in speech signals, TTS systems typically generate at a rate of at least 16,000 samples per second — a sample rate that has imposed extreme hardware constraints and prohibited many applications from being run on device.

MelGAN solves this hardware constraint. We built it with the goal of enabling raw audio waveform generation at high speed with minimal compute requirements, and it works. Without any specialized hardware for parallel processing, MelGAN is capable of generating speech samples faster than real time. The model achieves all of this while preserving very high output speech quality, and has the potential to extend to more structured data types, like music. We share examples of each in this post.

How MelGAN works

Modelling raw audio in speech is a challenging problem not just because of the high temporal resolution of the data (at least 16,000 samples per second) — the model must also contend with the structures that exist at different timescales such as speech reverberation, pitch, and noise profile. Thus, instead of modelling the raw temporal audio directly based on text, many modern TTS systems (Char2Wav, Tacotron2) simplify the problem by breaking it into a two-step process:

1. Modelling a lower-resolution representation such as a mel-spectrogram sequence conditional on text

2. Modelling raw audio waveforms conditional on that mel-spectrogram sequence (or another intermediate representation)

The two-step TTS pipeline with MelGAN generator used to transform a mel-spectrogram to raw audio samples.

The MelGAN generator is designed to enable an efficient audio sampling process in that second level of the TTS pipeline. Many state-of-the-art audio models are autoregressive in sampling — each audio sample has a direct causal dependency on a previous block of audio samples. This process provides the information required by the model to produce coherent output waveform, but it typically requires audio samples to be generated one at a time. As the WaveNet authors point out, “Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio.”

With MelGAN, it’s no longer essential to create these samples one at a time. Instead of the direct causal dependency in WaveNet, the relationship between each sample pair in MelGAN is implicit. By using a large number of dilated convolution blocks, even the samples that are temporally far share a significant amount of overlapping input mel-spectrogram frames and hidden layer nodes, yet the samples do not depend directly on one another. As a result, audio sample generation with MelGAN is completely parallelizable.

The MelGAN Generator architecture is fully convolutional. At each layer of the model, an input sequence can be processed in parallel.

10x ‍faster and small enough to run real-time processing on a CPU

The MelGAN generator is capable of sampling at a frequency of 2500kHz on GTX1080 Ti GPU in full precision, which is more than 10x faster than the fastest competing model on the same hardware, and 100x faster than real time. Even more importantly, MelGAN is one of the few models, across all comparable ones in terms of output quality, that can afford real-time processing on a typical CPU. The model is also significantly smaller than competing models in terms of number of parameters:

Despite having so many fewer parameters — more than 5 million fewer than the next smallest model — MelGAN is still capable of yielding speech samples with very realistic timbre and clean sound profile. Below you can find a few completely synthetic examples generated with a TTS system using MelGAN. We trained the whole pipeline with 15 hours of audio recordings from each speaker.

MelGAN’s three discriminators

In MelGAN, we use three discriminators operating at different time scales. One discriminator operates on the scale of raw audio, whereas the other two operate on raw audio downsampled by a factor of 2 and 4, respectively. This structure has an inductive bias that each discriminator learns features for a different frequency range of the audio. For example, the discriminator operating on 4 downsampled audio does not have access to the high frequency component, hence, it is biased to learn discriminative features based on low frequency components only. This ensures that the error signals received by the generator from the discriminators are more balanced across multiple hierarchies of audio modelling. ‍

The MelGAN Discriminator architecture has 3 discriminator blocks that operate on different time scales to capture discriminator features across different levels. Each block is fully convolutional to ensure the training operations are parallelizable across an audio sequence.

Each individual discriminator is a window-based discriminator. This is analogous to recognizing a picture by image patches. While a standard GAN discriminator learns to classify between distributions of entire audio sequences, a window-based discriminator learns to classify between distributions of small audio chunks. We chose window-based discriminators since they have been shown to capture essential high frequency structures, run faster, and can be applied to variable-length audio sequences.

Easily adapted to music translation and beyond

Besides modelling speech waveforms, we’re excited to share that MelGAN architecture can be easily adapted for musical applications such as translating musical instrument performance from one genre to another.

To create the samples below, we trained a MelGAN model with audio recordings from violin accompanied musical performances. We then encoded audio recordings of a cello performance using the music encoder from Universal Music Translation Network.When it is decoded back with the MelGAN generator, we can easily pick up the change in instrument sound characteristics:

This is certainly a cool application in itself, but what’s even more exciting is that this takes us one step closer to AI-assisted music composing.

Open source code

To contribute further to the growth of deep learning research in audio domain, we’ve decided to open source the MelGAN model. We encourage interested machine learning developers and researchers to check out our code base and we’re excited to see what people will build with it!

Acknowledgement

MelGAN was born out of a research project at Lyrebird led by Kundan Kumar and Rithesh Kumar, with contribution from Thibault de Boissière, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo and Alexandre de Brébisson. This work was done in collaboration with Aaron Courville and Yoshua Bengio, professors in deep learning at Mila.

Featured articles:

No items found.

Articles you might find interesting

Video

Shooting video? Here's a handy 11-point checklist for a smooth video shoot

This article explains each of the stages of video production, then breaks the production process down even further into a handy 11-point checklist you can use to plan your next video project — and end up with a great video to show for it.

Podcasting

How They Made It: Podcaster Traci Thomas on how The Stacks hit it big

Traci Thomas started The Stacks when she couldn't find the books podcast she wanted to listen to. She's since interviewed celebrities like Angelina Jolie. Here's how she did it.

Other stuff

Best Podcast Microphones Under $100 To Get You Started

As a podcaster, your listeners deserve the best audio. Luckily, you don’t have to break the bank to get your hands on a suitable microphone for podcasts.

Video

How to make a YouTube banner that attracts an audience, with examples

Your YouTube channel needs to give visitors a clear, enticing reason to stick around and hit subscribe. Your YouTube banner can do just that — as long as you follow some simple rules.

AI for Creators

How big-time creators are using generative AI

Look around and you’ll see all sorts of other interesting ways successful creators are putting AI to use — to collaborate in new ways, reimagine history, and make weird sandwiches.

Podcasting

How to use voice recording on iPhone for great podcast audio

Here's why voice recording on the iPhone is a good option for podcasters or anyone who needs to do recording on-the-go. We will also cover how you can do voice recording on an iPhone. 

Related articles:

Share this article

Get started for free →