ai-models

Corti’s Speech Model Is the Case for Healthcare AI Specialists

Anatoliy Kolodkin

20 May 2026 • 5 min read

The most interesting AI model release today is not another general-purpose system trying to win every leaderboard at once. It is a speech model arguing that some domains are too specific, too regulated, and too failure-sensitive for “good enough” transcription.

Corti launched Symphony for Speech-to-Text, a clinical speech recognition model and API for real-time dictation, conversational transcription, and batch audio processing. VentureBeat’s headline number is hard to ignore: Corti says Symphony reached 1.4% word error rate on English medical terminology, compared with 17.7% for OpenAI, 18.1% for ElevenLabs, 17.4% for Whisper, and 18.9% for Parakeet. On real-world English medical dictation, Corti reports 4.6% WER versus 5.7% for Dragon Medical One. In German medical speech, Corti claims 2.4% WER versus 13.0% for the next-best system; in French, 3.9% versus 10.6%.

Those are vendor-reported numbers, so nobody should swap production infrastructure on a press quote. But the direction is more important than the vanity metric: in healthcare, the error distribution matters more than the average. A generic model can look fine across normal words while failing exactly where the application is least allowed to fail — medications, dosages, measurements, symptoms, negations, acronyms, and clinical shorthand.

The transcript is becoming infrastructure

Corti’s arXiv paper makes the strongest version of the argument: medical speech recognition is not “turn audio into plausible text.” It is structured inference over clinical language. The system has to recognize specialized terminology, render measurements and abbreviations in clinically useful forms, correct context-sensitive ambiguity, support low-latency interfaces, and surface audio-quality problems before the encounter is over. That is a much higher bar than meeting-note transcription.

The API design reflects that. Symphony exposes three surfaces: /transcribe for stateless real-time dictation over secure WebSockets, /streams for stateful real-time conversational transcription, and /transcripts for asynchronous batch processing over REST. Under the hood, Corti says the system decomposes transcription into recognition, formatting, and contextual correction. It also supports speaker diarization, keyterm biasing, structured transcript generation, command-and-control workflows, and real-time audio-quality events.

That decomposition is the point. A normal speech model hears words. A clinical speech layer has to produce text that downstream software can safely reason over. If an ambient scribe, coding tool, EHR assistant, prior-authorization workflow, or triage agent consumes the transcript, the transcript stops being a document and becomes the input database. Corrupt the database and every “AI assistant” downstream inherits the mistake with more confidence and better formatting.

Andreas Cleve, Corti’s co-founder and CEO, told VentureBeat that “speech recognition requires more than simply producing a transcript” in the agentic era because systems need “accurate clinical facts to reason from.” That is not marketing fluff. It is the architecture constraint. If a system mishears “hyperthyroidism” as “hypothyroidism,” drops a negation, or turns “five milligrams” into “fifty milligrams,” the failure is not cosmetic. It changes clinical meaning.

Entity recall beats a prettier transcript

The most useful number in Corti’s launch is not the 1.4% WER. It is the claim that Symphony reached 98.3% recall on formatted clinical entities such as dosages, measurements, and dates, while the strongest general-purpose baseline reached 44.3%. That metric is closer to how practitioners should evaluate healthcare speech systems. Aggregate WER tells you how many words were wrong. Entity recall tells you whether the system captured the pieces that carry operational and safety risk.

For developers, this should change the procurement checklist. Do not ask only whether a speech API “sounds accurate” in a demo. Build a domain test set with the terms your clinicians actually use. Include medication names, brand/generic variants, abbreviations, units, ranges, laterality, acronyms, dates, allergies, family-history phrases, ruled-out diagnoses, and specialty-specific shorthand. Test the audio you actually have: bad microphones, emergency department noise, accents, interruptions, crosstalk, masks, Bluetooth compression, and clinicians dictating too quickly because they are clinicians.

Then measure the failure modes that matter. Word error rate is one line item. Add medical-term recall, dosage and unit accuracy, negation preservation, speaker attribution, formatting correctness, latency to interim and final results, hallucinated insertions, audio-quality alerts, and correction burden per note. If your downstream product uses the transcript to generate billing codes, clinical summaries, orders, or patient instructions, evaluate the transcript as safety-critical infrastructure, not a convenience feature.

This is where vertical AI has a real argument. The “bigger generalist model will absorb everything” story is appealing because it simplifies buying decisions. One model, one vendor, one integration, one dashboard. Healthcare does not reward that simplicity if the model’s training objective and evaluation set are misaligned with clinical risk. Corti’s paper says Symphony uses public speech and text-to-speech data plus a large proprietary Corti corpus spanning consultations, dictations, conversational speech, and everyday interactions, with synthetic examples for rare medical terminology and measurements. That kind of domain data is not a nice garnish. It is the product moat.

Specialization still needs independent proof

The caveat is obvious and important: Corti is the vendor, the benchmark publisher, and the company with the proprietary corpus. That does not make the results wrong. It does mean buyers should reproduce a smaller evaluation on their own workflows before making Symphony or any other clinical speech model a foundation dependency. Healthcare AI has enough leaderboard theater already; the answer is not to replace generic-model hype with vertical-model hype in a lab coat.

Corti is at least pointing in the right direction by releasing a benchmark dataset, MedDictate, on Hugging Face under corti/med-dictate. The paper says it contains nearly two hours of English, French, and German dictations by medical professionals, spanning domains such as radiology and psychology. Two hours is not enough to settle clinical speech recognition, but public domain-specific evaluation data is exactly what the market needs more of. Without shared tests, every buyer gets trapped in bespoke demos where the model sounds great until it meets the hospital’s actual microphones.

The broader Symphony context is also worth reading with care. Corti recently promoted Symphony for Medical Coding, a multi-agent coding system that it says uses codified rules and evidence-backed reasoning rather than pure pattern matching. It also claims its broader clinical model outperformed OpenAI and Anthropic systems on certain healthcare benchmarks. Those claims may be true, partly true, or overfit to Corti’s framing. But taken together, they show the company’s strategic bet: healthcare AI will be won by vendors that own domain-specific data layers and workflow primitives, not by wrappers around generalist APIs.

That is the right bet more often than the market wants to admit. In regulated domains, “close enough” is not close enough if the misses cluster around liability. A generalist model can still be useful in healthcare — summarization, drafting, search, coding assistance, internal tooling — but it should not automatically own the ingestion layer for clinical facts. Garbage in, beautifully formatted garbage out.

The practical move for builders is not “buy Corti.” It is “stop treating speech-to-text as interchangeable plumbing.” If voice is feeding an agent, an EHR workflow, a billing process, or a patient-facing output, the speech layer deserves the same evaluation discipline as the reasoning model. Build the test set. Measure entity recall. Inspect failure modes. Add audio-health UX. Keep humans in the loop where mistakes carry patient, billing, or compliance consequences.

Corti’s launch is useful because it pushes the conversation away from model size and toward model fit. The best AI system in healthcare may not be the one with the broadest benchmark résumé. It may be the one whose errors were optimized around the domain’s actual danger zone. That is less glamorous than another frontier-model leaderboard. It is also how production software gets trusted.

Sources: VentureBeat, Corti arXiv paper, Corti Medical Coding background

The transcript is becoming infrastructure

Entity recall beats a prettier transcript

Specialization still needs independent proof

Sign up for more like this.