ElevenLabs is launching its personal speech-to-text mannequin | TechCrunch

Date:

ElevenLabs, an AI startup that simply raised a $180 million mega funding spherical, has been primarily recognized for its audio technology prowess. The corporate took a step in one other technological path by launching its first standalone speech-to-text mannequin known as Scribe.

The startup, valued at $3.3 billion, has aided many different corporations in offering speech-to-text companies by its huge library of voices. Nevertheless, the corporate is now seeking to get into speech detection and compete with the likes of Gladia, Speechmatics, AssemblyAI, Deepgram, and OpenAI’s Whisper fashions.

ElevenLabs’ Scribe mannequin helps over 99 languages at launch. The corporate categorizes over 25 languages in wonderful accuracy class for the mannequin the place the phrase error fee is lower than 5%. This checklist consists of English (claimed accuracy fee of 97%), French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish, and Vietnamese. Different languages are ranked in several classes with excessive (5-10% phrase error fee), good (10 to twenty% phrase error fee), and reasonable (25 to 50%) phrase error charges.

The corporate stated that the mannequin outperformed Google Gemini 2.0 Flash and Whisper Massive V3 throughout a number of languages in FLEURS & Widespread Voice benchmark exams.

ElevenLabs had developed the speech-to-text part for its AI conversational agent platform, which was launched final 12 months. Nevertheless, that is the primary time the corporate is releasing a standalone speech detection mannequin. In a dialog with TechCrunch final month, CEO Mati Staniszewski talked about enhancing speech detection fashions.

“We want to understand what’s being said by you in a conversation better. We are working on ways to move away from only generating content and understanding and transcribing speech,” Staniszewski stated at the moment. “Many people say that speech-to-text is a solved problem. But for many languages, it is pretty bad. We think we can build better speech detection models because we have in-house teams to annotate data and give us quick feedback.”

The mannequin additionally has good speaker diarization to let you know who’s talking, timestamp at phrase degree for correct subtitles, and auto-tagging sound occasions like viewers laughters. The startup is offering a method for patrons to straight transcribe video content material so as to add subtitles or captions in its studio.

Scribe at the moment solely works with pre-recorded audio codecs. The corporate stated it is going to launch a low-latency real-time model of the mannequin quickly. Meaning it isn’t but efficient for assembly transcriptions or voice note-taking.

ElevenLabs is pricing Scribe at $0.40 for an hour of transcribed audio. Whereas the speed is aggressive, a few of its rivals provide a lower cost for audio transcriptions in the mean time with some function differentiation.

Share post:

Subscribe

Latest Article's

More like this
Related