February 17, 2021

How Descript Uses AI To Match Room Tone For Seamless Edits

Few recordings of human speech are recorded “perfectly” — that is to say, inside an acoustically treated studio, with a high-quality microphone, and without external noise or interference. Most are recorded imperfectly, with varying degrees of background noise and unwanted sound. Maybe there’s an air conditioner in the background or line-level fuzz. This is known as room tone, and when it comes to editing audio, dealing with room tone can be challenging and costly, requiring the use of dedicated plugins and software.
February 17, 2021

How Descript Uses AI To Match Room Tone For Seamless Edits

Few recordings of human speech are recorded “perfectly” — that is to say, inside an acoustically treated studio, with a high-quality microphone, and without external noise or interference. Most are recorded imperfectly, with varying degrees of background noise and unwanted sound. Maybe there’s an air conditioner in the background or line-level fuzz. This is known as room tone, and when it comes to editing audio, dealing with room tone can be challenging and costly, requiring the use of dedicated plugins and software.
February 17, 2021
Chris Zaldúa
In this article
Start editing audio & video
This makes the editing process so much faster. I wish I knew about Descript a year ago.
Matt D., Copywriter
Sign up

What type of content do you primarily create?

Videos
Podcasts
Social media clips
Transcriptions
Start editing audio & video
This makes the editing process so much faster. I wish I knew about Descript a year ago.
Matt D., Copywriter
Sign up

What type of content do you primarily create?

Videos
Podcasts
Social media clips
Transcriptions

Few recordings of human speech are recorded “perfectly” — that is to say, inside an acoustically treated studio, with a high-quality microphone, and without external noise or interference. Most are recorded imperfectly, with varying degrees of background noise and unwanted sound. Maybe there’s an air conditioner in the background or line-level fuzz. This is known as room tone, and when it comes to editing audio, dealing with room tone can be challenging and costly, requiring the use of dedicated plugins and software.

Today, Descript is proud to announce the release of a feature we’re simply calling Room Tone, which uses AI to automatically detect room tone in any given recording of speech — and then generates that tone when changing the timing of that recording. Need to add a gap clip between two sentences or paper over an edit without jarring dead air? Room Tone allows you to do that without a second thought (or even a second click).

Listen for yourself. This is a recording of Descript CEO Andrew Mason, edited with a gap clip of several seconds to adjust the timing.

The edit is dead silent — and painfully obvious. Now listen to this version, with Room Tone enabled.

That sounds much better!

And the more we started playing around with Room Tone, the more we realized it could do. Here’s a completely unedited clip of David Attenborough narrating as a pride of lions prepares to go on the hunt — followed by the exact same clip, after some editing and processing with Room Tone. Watch (and listen) for yourself!

We think Room Tone is pretty cool, and wanted to share with you some of the behind-the-scenes details that made it possible. Interested in working with a dedicated team to solve problems like this yourself? We’re hiring.

The inspiration

Our inspiration for Room Tone came from a particularly elegant research paper, a regular occurrence at Descript. Titled “Sound Texture Perception via Statistics of the Auditory Periphery: Evidence From Sound Synthesis” (Neuron 71, 926-940, 2011), it was authored by Josh H. McDermott and Eero P. Simoncelli. “It’s a beautiful paper — my favorite research paper of all time, actually,” says Prem Seetharaman, Research Scientist at Descript, who designed our Room Tone feature.

In their paper, McDermott and Simoncelli describe a statistical model for interpreting “sound textures,” which they describe as “the collective result of many similar acoustic events” — rainstorms, insect swarms, or galloping horses, for example.

With these statistical models in hand, McDermott and Simoncelli generated synthetic sound textures by contouring white noise frequencies to match natural sound texture models. Their contoured results were practically indistinguishable to the human ear from the natural phenomena they were modeled after, indicating that “sound texture perception is mediated by relatively simple statistics of early auditory representations,” which are then interpreted by the brain as representing a particular phenomenon.  

By shaping white noise frequencies like sculptors carving marble, Descript makes room tone one less thing to think about when editing audio.

In other words, you can think of McDermott and Simoncelli creating synthetic sound textures like sculptors working with a block of marble — except their raw material is white noise. You chip away at the block (or white noise frequencies) until you get the actual recognizable structure you want.

The idea

When conceiving the Room Tone feature, we hypothesized that room tone in audio recordings could be considered a “sound texture” analogous to those described in McDermott and Simoncelli’s research, like crackling fire or buzzing bees.

Following McDermott and Simoncelli’s results indicating that auditory statistics determine how we perceive sound textures, if we were able to create a statistical model of room tone in a given recording, we could filter white noise according to that model to create a synthetic sound texture indistinguishable from the natural room tone — even if the waveforms were different.


With this hypothesis in hand, we conducted further research and prototyped several models before narrowing the results down. Given a transcribed audio asset, we first automatically identify sections where only room tone audio is present. We then analyze those sections to create a room tone model. That model is then used to shape white noise according to the first order statistics of the model, and the results sound perceptibly identical to the room tone present in the original recording.  

This all happens on the back end, calculated whenever a user uploads an audio recording into a Descript Project.

The implementation

After developing an effective model, we worked collaboratively to implement room tone generation into the Descript app and surface the functionality to users.

Engineers from the editor and backend teams joined researchers and designers to consider how best to allow users to access Room Tone. Iterating quickly, they landed on a simple and elegant solution: It should trigger automatically whenever a gap clip is inserted into a Descript Project.

Whenever we build products, our goal is to make our users’ lives easier and their media editing workflows simpler — and even fun. Accordingly, user feedback is intrinsic to our development process.

After an early release of the feature to our beta user group, users reported key feedback that the gain of the generated room tone was an issue. We conducted further experiments based on this feedback, and adjusted the algorithm to fix this issue. The result is a seamless, friction-free experience — which means dealing with room tone is one less thing you need to think about when editing media in Descript.

Beginning today, you can try Room Tone yourself. To confirm the feature is enabled, click the drop-down arrow next to the name of your Project in the Project Browser and choose Project Settings.

To add a gap clip, simply click and drag the wordbar in the timeline wherever you’d like to adjust timing. Room Tone will take care of the rest, so you can say goodbye to dead-silent edits.


Want to help us solve problems like this one?

If you’re interested in creating elegant, simple solutions to tricky problems like this one, we want to work with you! Descript just closed a Series B round and we’re growing our team, looking for smart, creative people to help us make powerful media editing tools accessible to everyone.

We hope you enjoyed this glimpse behind the scenes — go download Descript if you haven’t already and give the Room Tone feature a try for yourself!

Chris Zaldúa
Former marketing writer at Descript. Covers interesting customer stories, product releases, and new ways to utilize Descript to create podcast and video content.
Share this article
Start creating—for free
Sign up
Join millions of others creating with Descript

How Descript Uses AI To Match Room Tone For Seamless Edits

Few recordings of human speech are recorded “perfectly” — that is to say, inside an acoustically treated studio, with a high-quality microphone, and without external noise or interference. Most are recorded imperfectly, with varying degrees of background noise and unwanted sound. Maybe there’s an air conditioner in the background or line-level fuzz. This is known as room tone, and when it comes to editing audio, dealing with room tone can be challenging and costly, requiring the use of dedicated plugins and software.

Today, Descript is proud to announce the release of a feature we’re simply calling Room Tone, which uses AI to automatically detect room tone in any given recording of speech — and then generates that tone when changing the timing of that recording. Need to add a gap clip between two sentences or paper over an edit without jarring dead air? Room Tone allows you to do that without a second thought (or even a second click).

Listen for yourself. This is a recording of Descript CEO Andrew Mason, edited with a gap clip of several seconds to adjust the timing.

The edit is dead silent — and painfully obvious. Now listen to this version, with Room Tone enabled.

That sounds much better!

And the more we started playing around with Room Tone, the more we realized it could do. Here’s a completely unedited clip of David Attenborough narrating as a pride of lions prepares to go on the hunt — followed by the exact same clip, after some editing and processing with Room Tone. Watch (and listen) for yourself!

We think Room Tone is pretty cool, and wanted to share with you some of the behind-the-scenes details that made it possible. Interested in working with a dedicated team to solve problems like this yourself? We’re hiring.

The inspiration

Our inspiration for Room Tone came from a particularly elegant research paper, a regular occurrence at Descript. Titled “Sound Texture Perception via Statistics of the Auditory Periphery: Evidence From Sound Synthesis” (Neuron 71, 926-940, 2011), it was authored by Josh H. McDermott and Eero P. Simoncelli. “It’s a beautiful paper — my favorite research paper of all time, actually,” says Prem Seetharaman, Research Scientist at Descript, who designed our Room Tone feature.

In their paper, McDermott and Simoncelli describe a statistical model for interpreting “sound textures,” which they describe as “the collective result of many similar acoustic events” — rainstorms, insect swarms, or galloping horses, for example.

With these statistical models in hand, McDermott and Simoncelli generated synthetic sound textures by contouring white noise frequencies to match natural sound texture models. Their contoured results were practically indistinguishable to the human ear from the natural phenomena they were modeled after, indicating that “sound texture perception is mediated by relatively simple statistics of early auditory representations,” which are then interpreted by the brain as representing a particular phenomenon.  

By shaping white noise frequencies like sculptors carving marble, Descript makes room tone one less thing to think about when editing audio.

In other words, you can think of McDermott and Simoncelli creating synthetic sound textures like sculptors working with a block of marble — except their raw material is white noise. You chip away at the block (or white noise frequencies) until you get the actual recognizable structure you want.

The idea

When conceiving the Room Tone feature, we hypothesized that room tone in audio recordings could be considered a “sound texture” analogous to those described in McDermott and Simoncelli’s research, like crackling fire or buzzing bees.

Following McDermott and Simoncelli’s results indicating that auditory statistics determine how we perceive sound textures, if we were able to create a statistical model of room tone in a given recording, we could filter white noise according to that model to create a synthetic sound texture indistinguishable from the natural room tone — even if the waveforms were different.


With this hypothesis in hand, we conducted further research and prototyped several models before narrowing the results down. Given a transcribed audio asset, we first automatically identify sections where only room tone audio is present. We then analyze those sections to create a room tone model. That model is then used to shape white noise according to the first order statistics of the model, and the results sound perceptibly identical to the room tone present in the original recording.  

This all happens on the back end, calculated whenever a user uploads an audio recording into a Descript Project.

The implementation

After developing an effective model, we worked collaboratively to implement room tone generation into the Descript app and surface the functionality to users.

Engineers from the editor and backend teams joined researchers and designers to consider how best to allow users to access Room Tone. Iterating quickly, they landed on a simple and elegant solution: It should trigger automatically whenever a gap clip is inserted into a Descript Project.

Whenever we build products, our goal is to make our users’ lives easier and their media editing workflows simpler — and even fun. Accordingly, user feedback is intrinsic to our development process.

After an early release of the feature to our beta user group, users reported key feedback that the gain of the generated room tone was an issue. We conducted further experiments based on this feedback, and adjusted the algorithm to fix this issue. The result is a seamless, friction-free experience — which means dealing with room tone is one less thing you need to think about when editing media in Descript.

Beginning today, you can try Room Tone yourself. To confirm the feature is enabled, click the drop-down arrow next to the name of your Project in the Project Browser and choose Project Settings.

To add a gap clip, simply click and drag the wordbar in the timeline wherever you’d like to adjust timing. Room Tone will take care of the rest, so you can say goodbye to dead-silent edits.


Want to help us solve problems like this one?

If you’re interested in creating elegant, simple solutions to tricky problems like this one, we want to work with you! Descript just closed a Series B round and we’re growing our team, looking for smart, creative people to help us make powerful media editing tools accessible to everyone.

We hope you enjoyed this glimpse behind the scenes — go download Descript if you haven’t already and give the Room Tone feature a try for yourself!

Featured articles:

No items found.

Articles you might find interesting

Podcasting

How to start a podcast on iPhone: 3 proven tips for success

If you’ve got an iPhone, you can start a podcast. This guide shares how to start a successful podcast using only your Apple smartphone.

Product Updates

How To Capture Great-Quality Video On a Limited Budget

2020 was the year many of us became impromptu video editors. Some of us had already been producing marketing videos, product videos, or how-to videos for our business or organization. Others created vlogs and social media videos for Instagram, TikTok, and YouTube. But for many of us, 2020 was the year we dived headfirst into remote work and video meetings — and they’re here to stay.

Product Updates

New in Descript: Custom branding on published pages and embeds

With today’s Descript update we’ve added custom branding to publishing, so you can add name and logo to your published pages and embedded videos (Pro and Enterprise accounts only). Plus you can now pause screen recordings in macOS, in case you need to take a break, answer the doorbell, or shush the dogs playing poker in your living room.

Podcasting

How to use background music in your podcast

Here, we share some tricks for selecting good background music for podcasts. We also offer an overview of music licensing, including copyrighted music in podcasts.

Podcasting

How to stop saying “um”: 6 tips for removing filler words

Want to stop saying um, like, you know, and other filler words? Learn tips for flawless speech—and how to edit out filler words afterward with Descript.

Related articles:

Share this article

Get started for free →