Scale AI launches Voice Showdown, the first real-world benchmark for voice AI — and the outcomes are humbling for some high fashions

Voice AI is shifting quicker than the instruments we use to measure it. Each main AI lab — OpenAI, Google DeepMind, Anthropic, xAI — is racing to ship voice fashions able to pure, real-time dialog.

However the benchmarks used to consider these fashions are largely nonetheless working on artificial speech, English-only prompts, and scripted take a look at units that bear little resemblance to how folks really discuss.

Scale AI, the giant knowledge annotation startup whose founder was poached by Meta last year to lead its Superintelligence Lab, is nonetheless going robust and tackling the downside head on: in the present day it launches Voice Showdown, what it calls the first world preference-based area designed to benchmark voice AI by the lens of actual human interplay.

This product affords a novel strategic worth to customers: free entry to the world’s main frontier fashions. By Scale’s ChatLab platform, customers can work together with high-tier fashions—which generally require a number of $20-per-month subscriptions—for gratis. In alternate, customers take part in occasional blind, head-to-head “battles” to select which of two anonymized main voice fashions affords a greater expertise, offering knowledge for the trade’s most genuine, human-preference leaderboard of voice AI fashions.

“Voice AI is actually the quickest shifting frontier in AI proper now,” stated Janie Gu, product supervisor for Showdown at Scale AI. “However the means that we consider voice fashions hasn’t saved up.”

The outcomes, drawn from hundreds of spontaneous voice conversations throughout greater than 60 languages, reveal functionality gaps that different benchmarks have persistently missed.

How Scale’s Voice Showdown works

Voice Showdown is constructed on ChatLab, Scale’s model-agnostic chat platform the place customers can freely work together with whichever frontier AI mannequin they select — at no cost — inside a single app. The platform has been out there to Scale’s world neighborhood of over 500,000 annotators, with roughly 300,000 having submitted at the very least one immediate. Scale is opening the platform to a public waitlist in the present day.

The analysis mechanism is elegant in its simplicity: whereas a consumer is having a pure voice dialog with a mannequin, the system often — on fewer than 5% of all voice prompts — surfaces a blind side-by-side comparability. The identical immediate is despatched to a second, nameless mannequin, and the consumer picks which response they like.

This design solves three issues that plague present voice benchmarks.

First, each immediate comes from actual human speech — with accents, background noise, half-finished sentences, and conversational filler — somewhat than synthesized audio generated from textual content.

Second, the platform spans greater than 60 languages throughout 6 continents, with over a 3rd of battles occurring in non-English languages together with Spanish, Arabic, Japanese, Portuguese, Hindi, and French.

Third, as a result of battles happen inside customers’ precise every day conversations, 81% of prompts are conversational or open-ended — questions with out a single appropriate reply. That guidelines out automated scoring and makes human desire the solely credible sign.

Voice Showdown at present runs two analysis modes: Dictate (customers converse, fashions reply with textual content) and Speech-to-Speech, or S2S (Speech-to-Speech, customers converse, fashions discuss again). A 3rd mode — Full Duplex, which captures real-time, interruptible dialog — is in growth.

Incentive-aligned voting

One design element units Voice Showdown aside from Chatbot Area (LM Area), the textual content benchmark it most intently resembles. In LM Area, critics have famous that customers typically forged throwaway votes with little stake in the final result. Voice Showdown addresses this immediately: after a consumer votes for the mannequin they most well-liked, the app switches them to that mannequin for the remainder of their dialog. If you happen to voted for GPT-4o Audio over Gemini, you are now speaking to GPT-4o Audio. That alignment of consequence with desire discourages informal or dishonest voting.

The system additionally controls for confounds that might corrupt comparisons: each mannequin responses start streaming concurrently (eliminating pace bias), voice gender is matched throughout each choices (eliminating gender desire bias), and neither mannequin is recognized by title throughout voting.

The brand new Voice AI leaderboard each enterprise decision-maker ought to concentrate to

Voice Showdown launches with 11 frontier fashions evaluated throughout 52 model-voice pairs as of March 18, 2026. Not all fashions assist each analysis modes — the Dictate leaderboard consists of 8 fashions, whereas S2S consists of 6.

Dictate Leaderboard (Speech-In, Textual content-Out)

On this mode, customers present a spoken immediate and consider two side-by-side textual content responses. Right here are the baseline scores:

Gemini 3 Professional (1073)
Gemini 3 Flash (1068)
GPT-4o Audio (1019)
Qwen 3 Omni (1000)
Voxtral Small (925)
Gemma 3n (918)
GPT Realtime (875)
Phi-4 Multimodal (729)

Notice: Gemini 3 Professional and Gemini 3 Flash are statistically tied for the high rank.

Speech-to-Speech (S2S) Leaderboard

On this mode, customers converse to the mannequin and consider two competing audio responses. Additionally baselines:

Gemini 2.5 Flash Audio (1060)
GPT-4o Audio (1059)
Grok Voice (1024)
Qwen 3 Omni (1000)
GPT Realtime (962)
GPT Realtime 1.5 (920)

Notice: Gemini 2.5 Flash Audio and GPT-4o Audio are statistically tied for the high rank in baseline evaluations.

Dictate rankings are led by Google’s Gemini 3 Professional and Gemini 3 Flash, which are statistically tied at #1 with Elo scores round 1,043-1,044 after fashion controls.

GPT-4o Audio holds a transparent third place. Open-weight fashions together with Gemma3n, Voxtral Small, and Phi-4 Multimodal path considerably.

Speech-to-Speech (S2S) rankings present a tighter race at the high, with Gemini 2.5 Flash Audio and GPT-4o Audio statistically tied at #1 in the baseline rankings.

After adjusting for response size and formatting — components that may inflate perceived high quality — GPT-4o Audio pulls forward (1,102 Elo vs. 1,075 for Gemini 2.5 Flash Audio).

Grok Voice jumps to an in depth second at 1,093 below fashion controls, suggesting its uncooked #3 rating undersells its precise efficiency high quality.

Qwen 3 Omni, the open-weight mannequin from Alibaba’s Qwen group, performs higher on pure desire than its recognition would counsel — rating fourth in each modes, forward of a number of higher-profile names.

“When folks are available in, they go for the huge names,” Gu famous. “However for desire, lesser-known fashions like Qwen really pull forward.”

Shocked revealed by real-world desire knowledge

Past rankings, Voice Showdown’s actual worth is in the failure diagnostics — and people paint a extra sophisticated image of voice AI than most leaderboards reveal.

The multilingual hole is worse than you suppose

Language robustness is the starkest differentiator throughout fashions. In Dictate, Gemini 3 fashions lead throughout primarily each language examined.

In S2S, the winner relies upon closely on which language is being spoken: GPT-4o Audio leads in Arabic and Turkish; Gemini 2.5 Flash Audio is strongest in French; Grok Voice is aggressive in Japanese and Portuguese.

However the extra alarming discovering is how ceaselessly some fashions merely cease responding in the consumer’s language in any respect.

GPT Realtime 1.5 — OpenAI’s newer real-time voice mannequin — responds in English to non-English prompts roughly 20% of the time, even on high-resource, formally supported languages like Hindi, Spanish, and Turkish.

Its predecessor, GPT Realtime, mismatches at about half that fee (~10%). Gemini 2.5 Flash Audio and GPT-4o Audio sit at ~7%.

The phenomenon runs each instructions: some fashions carry non-English context from earlier in a dialog into an English flip, or just mishear a immediate and generate an unrelated response in the mistaken language solely.

Person verbatims from the platform seize the frustration bluntly: “I stated I’ve an interview in the present day with Quest Administration and as an alternative of answering, it gave me information about ‘Danger Administration.'”

“GPT Realtime 1.5 thought I used to be talking incoherently and advisable psychological well being help, whereas Qwen 3 Omni accurately recognized I used to be talking a Nigerian native language.”

The rationale present benchmarks miss this: they’re constructed on artificial speech optimized for clear acoustic situations, they usually’re not often multilingual. Actual audio system in actual environments — with background noise, quick utterances, and regional accents — break speech understanding in methods lab situations do not anticipate.

Voice choice is greater than aesthetics

Voice Showdown evaluates fashions not simply at the mannequin degree however at the particular person voice degree — and the variance inside a single mannequin’s voice catalog is putting.

For one unnamed mannequin in the research, the best-performing voice received 30 share factors extra typically than the worst-performing voice from the similar underlying mannequin. Each voices share the similar reasoning and technology backend. The distinction is purely in audio presentation.

The highest-performing voices have a tendency to win or lose on audio understanding and content material completeness — whether or not the mannequin heard you accurately and answered totally. However speech high quality stays a deciding issue at the voice choice degree, notably when fashions are in any other case comparable. “Voice immediately shapes how customers consider the interplay,” Gu stated.

Fashions degrade in dialog

Most benchmarks take a look at a single flip. Voice Showdown checks how fashions maintain up throughout prolonged conversations — and the outcomes aren’t flattering.

On Flip 1, content material high quality accounts for 23% of mannequin failures. By Flip 11 and past, it turns into the main failure mode at 43%. Most fashions see their win charges decline as conversations lengthen, struggling to preserve coherence throughout a number of exchanges.

GPT Realtime variants are an exception, marginally bettering on later turns — in keeping with their recognized strengths on longer contexts, and their documented weak spot on the temporary, noisy utterances that dominate early interactions.

Immediate size exhibits a complementary sample: quick prompts (below 10 seconds) are dominated by audio understanding failures (38%), whereas lengthy prompts (over 40 seconds) shift the main failure towards content material high quality (31%). Shorter audio offers fashions much less acoustic context to parse; longer requests are understood however more durable to reply properly.

Why some voice AI fashions lose

After each S2S comparability, customers tag why they most well-liked one response over the different throughout three axes: audio understanding, content material high quality, and speech output. The failure signatures differ meaningfully by mannequin.

Qwen 3 Omni’s losses cluster round speech technology — its reasoning is aggressive, however customers are postpone by the way it sounds. GPT Realtime 1.5’s losses are dominated by audio understanding failures (51%), in keeping with its language-switching conduct on difficult prompts. Grok Voice’s failures are extra balanced throughout all three axes, indicating no single dominant weak spot however no explicit energy both.

What’s subsequent

The present leaderboard covers turn-based interplay — you converse, the mannequin responds, repeat. However actual voice conversations do not work that means. Folks interrupt, change route mid-sentence, and discuss over one another.

Scale says Full Duplex analysis — designed to seize these real-time dynamics by human desire somewhat than scripted eventualities or automated metrics — is coming to Showdown subsequent. No present benchmark captures full-duplex interplay by natural human desire knowledge.

The leaderboard is stay at scale.com/showdown. A public waitlist to be a part of ChatLab and vote on comparisons is open in the present day, with customers receiving free entry to frontier voice fashions together with GPT-4o, Gemini, and Grok in alternate for infrequent desire votes.

Disclaimer: This article is sourced from external platforms. OverBeta has not independently verified the information. Readers are advised to verify details before relying on them.