OpenAI upgrades its transcription and voice-generating AI fashions | TechCrunch

Date:

OpenAI is bringing new transcription and voice-generating AI fashions to its API that the corporate claims enhance upon its earlier releases.

For OpenAI, the fashions match into its broader “agentic” imaginative and prescient: constructing automated methods that may independently accomplish duties on behalf of customers. The definition of “agent” is likely to be in dispute, however OpenAI Head of Product Olivier Godement described one interpretation as a chatbot that may converse with a enterprise’s clients.

“We’re going to see more and more agents pop up in the coming months” Godement instructed TechCrunch throughout a briefing. “And so the general theme is helping customers and developers leverage agents that are useful, available, and accurate.”

OpenAI claims that its new text-to-speech mannequin, “gpt-4o-mini-tts,” not solely delivers extra nuanced and realistic-sounding speech however can be extra “steerable” than its previous-gen speech-synthesizing fashions. Builders can instruct gpt-4o-mini-tts on the way to say issues in pure language — for instance, “speak like a mad scientist” or “use a serene voice, like a mindfulness teacher.”

Right here’s a “true crime-style,” weathered voice:

And right here’s a pattern of a feminine “professional” voice:

Jeff Harris, a member of the product workers at OpenAI, instructed TechCrunch that the purpose is to let builders tailor each the voice “experience” and “context.”

“In different contexts, you don’t just want a flat, monotonous voice,” Harris stated. “If you’re in a customer support experience and you want the voice to be apologetic because it’s made a mistake, you can actually have the voice have that emotion in it … Our big belief, here, is that developers and users want to really control not just what is spoken, but how things are spoken.”

As for OpenAI’s new speech-to-text fashions, “gpt-4o-transcribe” and “gpt-4o-mini-transcribe,” they successfully exchange the corporate’s long-in-the-tooth Whisper transcription mannequin. Skilled on “diverse, high-quality audio datasets,” the brand new fashions can higher seize accented and different speech, OpenAI claims, even in chaotic environments.

They’re additionally much less more likely to hallucinate, Harris added. Whisper notoriously tended to manufacture phrases — and even complete passages — in conversations, introducing all the pieces from racial commentary to imagined medical therapies into transcripts.

“[T]hese models are much improved versus Whisper on that front,” Harris stated. “Making sure the models are accurate is completely essential to getting a reliable voice experience, and accurate [in this context] means that the models are hearing the words precisely [and] aren’t filling in details that they didn’t hear.”

Your mileage might fluctuate relying on the language being transcribed, nonetheless.

In response to OpenAI’s inside benchmarks, gpt-4o-transcribe, the extra correct of the 2 transcription fashions, has a “word error rate” approaching 30% (out of 120%) for Indic and Dravidian languages equivalent to Tamil, Telugu, Malayalam, and Kannada. Meaning three out of each 10 phrases from the mannequin will differ from a human transcription in these languages.

The outcomes from OpenAI transcription benchmarking.Picture Credit:OpenAI

In a break from custom, OpenAI doesn’t plan to make its new transcription fashions brazenly out there. The corporate traditionally launched new variations of Whisper for business use underneath an MIT license.

Harris stated that gpt-4o-transcribe and gpt-4o-mini-transcribe are “much bigger than Whisper” and thus not good candidates for an open launch.

“[T]hey’re not the kind of model that you can just run locally on your laptop, like Whisper,” he continued. “[W]e want to make sure that if we’re releasing things in open source, we’re doing it thoughtfully, and we have a model that’s really honed for that specific need. And we think that end-user devices are one of the most interesting cases for open-source models.”

Up to date March 20, 2025, 11:54 a.m. PT to make clear the language round phrase error charge and up to date the benchmark outcomes chart with a newer model.

Share post:

Subscribe

Latest Article's

More like this
Related