In recent years there’s been a surge in automatic speech-to-text solutions, most of them offered by tech giants like Google, Amazon, and Microsoft. But which is the best? We couldn’t find many good comparisons, and we needed to do one for ourselves, so we decided to share it.
We have good reason to do this: Descript is API-agnostic, which means we can work with whichever ASR provider gives us and our customers the best combination of speed, accuracy, and affordability. To date that mantle has been held by Google Speech’s ASR engine, which provides remarkable accuracy across a wide range of audio scenarios (this is important, for reasons we’ll get into shortly).
In other words, we like to keep tabs on which service is the best, so that we can give our customers the best service.
To that end, here’s how our test was run, what you should know about it — and the results. The industry is moving quickly, so we’ll continue to test these services (and any new ones that join the fray) in the future.
As we mention in our article dedicated to the challenges of testing speech recognition accuracy, speech models are typically measured using audio that’s not representative of how our customers use Descript. So for our test, we pulled samples of audio from YouTube videos that reflect the type of audio transcribed by our customers — primarily:
We also sought to ensure each category of audio had samples with a variety of different speakers (including some with accents) and topics discussed.
It is perhaps worth noting that the process of sample selection included decision-making at several stages, which inevitably introduces bias. That said, Descript doesn’t have much to gain from running a test that yields inaccurate results (after all, we want to figure out which ASR engine is the best so that we can use it!) Our main bias is probably toward audio that is reasonably intelligible to our ears, but we’ve included plenty of samples recorded in sub-optimal conditions.
We’ve embedded the individual audio files and results in this post so you can decide the degree to which it applies to your own audio.
Once we compiled our corpus of test audio — which totaled about 50 clips, 3–5 minutes of footage apiece — we had each one professionally transcribed; we then conducted a follow-up QA pass to verify accuracy. These professional transcriptions serve as the “Ground Truth”, or the reference transcripts that the automated services are tested against.
Next, we ran each audio sample through each ASR engine using their respective default parameters. Some ASR engines support customization for specific audio scenarios (if you always know you’re dealing with phone calls, you can optimize for that). But Descript users are working with all kinds of audio, so we aim for the most generally-accurate ASR engine.
Finally, we applied the accuracy test widely used by the ASR field, Word Error Rate, to compare the accuracy of the automatically-generated transcript to its professionally-transcribed counterpart. This included a normalization process (typical for this sort of testing) that allowed for certain stylistic discrepancies between the ASR-generated and ‘ground truth’ transcripts — e.g. telling the test that “will not” and “won’t” are functionally equivalent, and that 3 and “three” mean the same thing.
With an average WER of 16%, Google Speech (Video) is the most accurate ASR engine in our testing. For many audio samples, Google’s engine scored a WER well under 10% — as low as 2% for some high-quality audiobook samples.
Also, congratulations to Google on securing… Last place?
Let’s dig into this a little. Google offers several versions of its ASR engine, including one tailored for phone calls and another for brief commands; we tested the two best suited for broad use-cases: Google’s Standard and a newer Video model. Remarkably, in our testing these represent both the worst-performing and best-performing ASR engines.
Google says the Standard model is best for general use, like single-speaker long-form audio; Video is supposed to be better at transcribing multiple speakers (and videos). But in our testing we found that Video is much better at everything — even audiobook samples with a single speaker (in theory, the sort of content Standard should excel at).
This isn’t entirely surprising: the Video model is newer, and Google charges twice as much for it. But it goes to show just how quickly the field is moving.
Another intriguing result: second place in our test goes to Temi, the automated transcription engine made by Rev — a company best known for its human transcription services. This is a laudable achievement, given that Rev is a much smaller company than competitors like Google, Microsoft, Amazon, and IBM. And it’s a positive sign that ASR won’t necessarily be dominated by the tech giants, despite their expansive datasets and resources.
There’s only a 2% difference between Google Speech (Video) and Temi, which may not be statistically significant — so we decided to see what happens when you take a closer look comparing the two. Among our findings:
Some additional trends we found, overall:
With all of that said, congratulations to Google Speech on scoring top marks in our tests. We’ve been using the Google Speech (Video) model to power Descript since we launched last December—with great feedback, mostly — and we plan on sticking with it.
If you’re interested in seeing the actual data we used to conduct these tests — or filter down to the see the results for a particular ASR engine or the audio that’s most similar to yours — the embedded Airtable below has everything you need.
We’ve mentioned most of these already, but for the sake of clarity:
Tim Bunce has written a series of posts about his own testing of ASR engines. His approach differs in some important ways and his insights are interesting.
Also see our other posts about automatic speech recognition, including:
Special thanks to Arlo Faria and the team at Remeeting. Arlo was kind enough to give us substantial guidance on running our tests.
Also many thanks to Scott Stephenson of Deepgram for sharing his insights.
Have thoughts or feedback regarding our tests? Please leave a note in the comments or send us an email.