Right before jumping on the phone Friday afternoon, Andrew Mason, who then ran a walking tour startup called Detour and ran Groupon, was hand-correcting a transcription of a speech by John F. Kennedy — which was transcribed by some new software he and his team built in-house.
But Descript, Mason’s new startup that’s spun out from Detour, isn’t designed to just transcribe audio (even bad audio, like a recording of JFK’s speech). Instead, the goal for Descript is to take that transcription, put it into a Word document, and allow an editor or producer to edit the sound file much in the same way a writer would edit a Word document. When you cut out a word in the transcription, it cuts it out in the sound file. And if all goes well, when you add a word, it’ll end up in the sound file, too. To do all this, Mason and his team have raised $5 million in funding from Andreessen-Horowitz to start it off on its own.
“We see ourselves as partly pressing the reset button on how media gets produced to enable a new era of AI-driven media production, where AI is kind of a companion in the process,” Mason said. “By having that coupling of that two forms of information, it lets you do natural language processing and understand the intent of the audio, which just opens up all kinds of possibilities when you think of AI-driven media synthesis. Imagine underscoring something with music generated by an AI. All that stuff is coming, and we see Descript as the foundation for it.”
The Descript editor is a pretty straightforward product: it’s a Word document that corresponds to a sound file. Rather than diving into software designed for editing sound products like podcasts, Descript aims to build a simple what-you-see-is-what-you-get interface that you would expect when you pop open Google Docs or something to that extent. It’s designed to be simple by mimicking a text document — which makes sense, given decades of refinement, development, and testing landed us with an empty blank document in a browser for all writing purposes.
Descript’s origins are within Detour — Session recordings were short, but editing could take hours or even days to end up with a high-quality product for Detour. And that’s also assuming they didn’t have to bring someone back into a recording studio. Instead of finding ways to cut and copy sound files, Descript was designed for those little annoying changes you might have to make to make something sound cleaner. It’s priced similarly to some transcription services today on a per-minute basis, charging 7 cents per minute (or 99 cents per minute to have someone deal with it by hand).
“The word processor is the ultimate craftsman tool, you learn it early on and you’re done,” Mason said. “It’s not that way if you’re on audio or video. You’re on a constant journey of keeping up with technology. If you’re writing an article and there’s a sentence you don’t like you rewrite it, you don’t think twice about it.”
Descript, too, sound be an easier sell as a product — or even a business. Rather than convincing someone to literally take a detour, Mason and his team just have to walk into a producer’s office and offer a quick demo. Should it work on-the-spot, the implications of technology like that are pretty clear, whether they work with podcasts or radio or any other kind of spoken media. And there are plenty of implications that could come down the line, too, like voice acting. There are some other interesting projects in the area around voice mimicking, like Lyrebird, though the story hasn’t fully played out just yet here.
Though it’s geared toward publishers and other media organizations, the natural endpoint of a product like Descript seems to be one where you could write up a document and end up in someone’s voice. And as this technology only continues to improve, there certainly will be challenges to help ensure that people aren’t using this kind of technology (though Mason says it won’t be through Descript) for malicious purposes. In the end, though, it’s not unlike previous major shifts in the way media is produced and can be edited, though.
“We’re quickly heading toward a future where audio and video content, their credibility comes down to the source in the same way that it is for photos and print,” Mason said. “It’s been that way for print for a very long time, it’s been that way for photos for the last 10 to 20 years. It’ll soon be that way for audio and video, and just as society did before it’ll once again recalibrate around how to verify what’s real. This use case is really for people to produce their own content. There are controls we can put in place to do that.”