#3189: Drawing the Melody: SSML's Hidden Power

How SSML gives developers narrative control over AI voices — and why ElevenLabs became its center of gravity.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-3359
Published: Jun 1
Duration: 28:30
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: text-to-speech audio-engineering conversational-ai

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Speech Synthesis Markup Language (SSML) is an XML-based W3C standard that gives developers explicit control over how text-to-speech engines render audio. While modern neural TTS models produce remarkably natural prosody from plain text, they can't read your mind — they don't know which word to emphasize, where to pause for dramatic effect, or when a character should whisper. SSML fills that gap.

The prosody tag is the workhorse, controlling rate, pitch, volume, and contour. Contour is the most misunderstood and powerful feature: it lets you define a pitch curve over a phrase by specifying time-position and pitch pairs, essentially drawing the melody of a sentence. The model treats these as strong suggestions, interpolating between your specified points while still applying its own learned prosody model.

ElevenLabs has emerged as the center of gravity for SSML adoption, supporting about 28 tags as of June 2026. Its custom emotion tag — not part of the W3C spec — allows developers to apply emotional labels like excited, sad, whisper, or narration to any voice, including cloned voices. Under the hood, these activate different regions of the model's latent space, adjusting spectral qualities like breathiness and tension without requiring manual acoustic parameter specification.

The biggest pitfall? Over-tagging. Neural networks already have strong internal prosody models. Replacing every pause and pitch change produces technically correct but emotionally dead output. The best SSML applies the lightest touch, using markup to inject domain knowledge the model can't infer from text alone.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#3189: Drawing the Melody: SSML's Hidden Power

Daniel sent us this one about SSML — Speech Synthesis Markup Language — and how ElevenLabs has kind of become the gravitational center for the standard. The core question is: what can you actually mark up to control prosody and expression in TTS workflows? There's a lot to dig into here, because SSML is one of those things that sits between raw text and genuinely human-sounding speech, and most developers never touch it.

I mean, here's the thing — you can feed the same sentence into ElevenLabs with plain text and with SSML, and the difference is night and day. Take something like "I didn't say she stole the money." Without markup, the model guesses where to place emphasis. With SSML, you can direct the stress to any word in that sentence and get seven completely different meanings. That's not a party trick — that's narrative control.

Seven different accusations in one sentence.

And that's what SSML gives you — the ability to tell the neural network "no, emphasize this word, slow down here, pause for effect there." Without it, you're just hoping the model reads your mind.

Before we get into the weeds of specific tags, let's level-set on what SSML actually is and why it matters more now than ever.

SSML is an XML-based markup language standardized by the W3C — the World Wide Web Consortium — back in 2004, with version 1.1 published in September 2010. It was designed to give authors explicit control over speech synthesis beyond just feeding in plain text. Think of it as a declarative annotation layer that sits between your text and the TTS engine's neural network. It's not a programming language — you're not writing logic, you're annotating how things should sound.

It's more like stage directions than a script.

And the landscape is fragmented. Amazon Polly, Google Cloud TTS, Microsoft Azure, IBM Watson, and ElevenLabs all implement overlapping but non-identical subsets of the W3C spec. Azure supports around forty tags total. ElevenLabs supports about twenty-eight as of June 2026. But here's the thing — ElevenLabs has emerged as the center of gravity because their expressive voice models are just better, and their documentation is developer-friendly.

Their parser is more forgiving than Azure's. Azure will throw errors on unsupported tags. ElevenLabs silently ignores them, which is both a blessing and a curse — great for prototyping, but you can ship bad markup and never know it.

Which is a real pitfall in production. We'll come back to that. But the bigger point is: why should a developer care about SSML when modern TTS models already sound good out of the box?

Because "good" is the enemy of "right.

Modern neural TTS is remarkably good at general prosody — it knows how to read a declarative sentence with natural intonation. But it doesn't know your content. It doesn't know that this particular brand name is pronounced differently, or that this paragraph is supposed to sound urgent, or that this character in your audiobook is whispering. SSML is how you inject that domain knowledge.

Long-form content is where the seams really show. A model might nail individual sentences but lose coherence over paragraphs — pacing drifts, emotional tone wanders. SSML gives you guardrails.

Which brings us to the actual tags. So with the landscape mapped, let's start with the tag that gives you the most control per character typed: prosody.

The prosody tag controls rate, pitch, volume, and contour. ElevenLabs supports all four attributes. Rate accepts both relative values — like plus twenty percent — and absolute values like one-point-five-x. Pitch can be specified in hertz, semitones, or relative percentages. Volume is straightforward — relative or absolute decibel adjustments.

Contour is where it gets interesting.

Contour is the most misunderstood feature in SSML, and it's incredibly powerful. It lets you define a pitch curve over the duration of a phrase by specifying a list of time-position and pitch pairs. So you'd write something like contour equals open parenthesis zero percent comma plus zero hertz close parenthesis comma open parenthesis fifty percent comma plus one hundred hertz close parenthesis comma open parenthesis one hundred percent comma minus fifty hertz close parenthesis. The engine interpolates between those points, creating a rising-then-falling intonation curve.

You're essentially drawing the melody of the sentence.

You're drawing the melody. And this matters because different languages have different intonation patterns. English tends to use rising intonation for questions and falling for statements, but within a single complex sentence, you might want a rise in the middle to signal that more is coming, then a fall at the end for closure. Contour gives you that control.

How does that actually map to the neural network internally? Is it directly manipulating the pitch contour of the output waveform, or is it more like a hint that the model interprets?

It's more of a hint. The contour values are embedded as conditioning vectors that get fed into the decoder alongside the text embeddings. The model isn't forced to hit those exact pitch values — it's nudged in that direction while still applying its own learned prosody model. So if you set a contour that's wildly unnatural, the model will try to compromise between your instruction and what sounds human.

It's a suggestion, not a command.

A very strong suggestion. And that's actually good design — it prevents you from producing completely broken audio. But it also means you need to test your contours empirically rather than assuming they'll be rendered exactly as specified.

What about stacking? If I put an emotion tag and a prosody tag on the same phrase, do they compound or conflict?

They compound, but the behavior is nuanced. ElevenLabs processes tags in document order, and later tags can override earlier ones for overlapping attributes. But emotion operates on a different axis than prosody. Emotion adjusts the voice's spectral qualities — breathiness, tension, resonance — while prosody adjusts the acoustic parameters. So an emotion equals excited plus a prosody rate equals slow creates an interesting tension. The model will sound excited in timbre but slow in pace, which can be useful for things like a character who's breathless with excitement but trying to speak deliberately.

Like someone giving a wedding toast after running up a flight of stairs.

That's oddly specific, but yes. And that combination is something you simply cannot achieve with plain text input. The model left to its own devices would either sound excited and fast, or calm and slow. The mixed signal requires explicit markup.

Let's talk about breaks and pauses. This seems like the simplest tag but also the easiest to overuse.

The break tag is deceptively powerful. It takes either a time attribute — like two hundred fifty milliseconds, one point five seconds — or a strength attribute with values like none, x-weak, weak, medium, strong, and x-strong. In ElevenLabs, x-strong produces a full one-second pause. By comparison, Amazon Polly's x-strong is about eight hundred milliseconds.

ElevenLabs gives you an extra two hundred milliseconds of dramatic silence.

Which doesn't sound like much until you're listening to an audiobook and the narrator pauses for exactly the right beat before a reveal. That extra two hundred milliseconds is the difference between "I need to tell you something..." feeling natural versus feeling stilted.

The danger is that people start peppering x-strong breaks everywhere and the output sounds like William Shatner reading a grocery list.

Over-tagging is the single biggest mistake people make with SSML. The neural network already has a good internal prosody model. If you override every pause and every pitch change, you get something that sounds technically correct but emotionally dead — what audio engineers call "uncanny valley" speech. The best SSML is often the lightest touch.

Let's move to the tag that really sets ElevenLabs apart — the emotion tag.

This is the big differentiator. The emotion tag is not in the W3C spec at all — it's a custom ElevenLabs extension added in API version one-point-three-point-oh, released March fifteenth of this year. It accepts attributes like excited, sad, angry, whisper, and narration. No other major provider offers this level of granular emotion control at the tag level. Azure has mstts colon express-as, which is their equivalent, but it's tied to specific voice styles rather than being a general-purpose emotion layer.

Azure's approach is "this voice can do these styles," while ElevenLabs says "apply this emotion to any voice.

And that's philosophically different. ElevenLabs' emotion tag is voice-agnostic. You can apply whisper to Adam or Rachel or any custom voice clone, and the model adjusts the spectral envelope accordingly. It's much more flexible.

What's actually happening under the hood when you apply whisper?

The model is being conditioned on a different region of its latent space. During training, ElevenLabs presumably annotated portions of their training data with emotional labels, and the model learned to associate certain acoustic profiles with those labels. So whisper activates a combination of reduced subglottal pressure simulation, increased aspiration noise, and lowered amplitude — all the physical correlates of actual whispering — without you having to specify any of those parameters manually.

It's a high-level semantic label that unpacks into a whole set of acoustic modifications.

And that's the direction SSML is heading — away from low-level acoustic parameters and toward semantic, intent-based markup. Instead of specifying pitch in hertz, you specify "this should sound urgent," and the model figures out the acoustic realization.

Which raises an interesting question about whether SSML eventually becomes obsolete if models get good enough at inferring intent from context alone.

I think we'll get to that. But first, let's talk about voice selection, because this is another area where ElevenLabs shines. The voice tag with a name attribute lets you switch voices mid-document. So you can have a single SSML document that contains a dialogue between Adam and Rachel, with each line tagged to the appropriate voice, and it all renders in a single API call.

Which is huge for things like customer service bots or audiobook production. You're not making separate calls and stitching audio together.

The voice tag is part of the core W3C spec, so it's portable across providers, but ElevenLabs' implementation is particularly smooth because their voice cloning quality means you can switch between highly realistic custom voices without any degradation. Here's a concrete example — imagine a customer service bot scenario. You'd have something like: voice name equals Adam — "Thank you for calling. How can I help you today?" Then voice name equals Rachel — "Hi, I'm calling about my order." And the output is a seamless conversation with two distinct, natural-sounding voices, generated in one shot.

The voices maintain their identity across the switch — Rachel doesn't suddenly sound like Adam doing an impression of Rachel.

Voice consistency is preserved because each voice has its own embedding that conditions the decoder independently for each tagged segment.

Prosody gets you eighty percent of the way there, but the remaining twenty percent — pronunciation, interpretation, and audio embedding — is where the real polish happens.

Let's start with pronunciation control, because this is where things get technical but also where the most common production failures happen. The phoneme tag lets you specify exact pronunciation using either the International Phonetic Alphabet or X-SAMPA. ElevenLabs supports both, but IPA is more reliable in practice.

When would you use phoneme versus the sub tag for pronunciation fixes?

The sub tag replaces text with an alternate string before synthesis — so you'd write sub alias equals World Health Organization, then the text WHO, and the engine reads "World Health Organization" but the output text shows WHO. The phoneme tag is for cases where no text substitution will produce the right pronunciation. So for medical terminology, proper nouns, or foreign words, phoneme is the right tool.

Give me an example.

Take the word "myelin" — the sheath around nerve fibers. A medical app generating neurology reports needs this pronounced correctly. You'd write: phoneme alphabet equals ipa ph equals forward slash maɪ dot əl dot ɪn forward slash, then the word myelin, close tag. Without this, some TTS engines will say "my-lin" rhyming with "violin," which is wrong. The IPA forces the correct three-syllable pronunciation: my-uh-lin.

If you're building a medical education app, getting that wrong isn't just annoying — it's misleading.

It undermines credibility. And this applies to brand names too. If your app is generating audio that mentions "Nike" and the engine says it to rhyme with "bike," you've got a problem. The sub tag handles that — sub alias equals nikey, then Nike — but for anything more phonetically complex, you need phoneme.

Let's talk about say-as. This is the interpretation tag that tells the engine how to handle dates, times, numbers, and so on.

The say-as tag uses the interpret-as attribute with values like cardinal, ordinal, date, time, telephone, characters, and expletive. Each provider handles these slightly differently, and the differences matter. Let me give you a comparison. If you feed the date June first, twenty twenty-six — written as 2026-06-01 — to Azure with say-as interpret-as equals date format equals ymd, Azure reads it as "June first, twenty twenty-six." ElevenLabs' default behavior for the same input without explicit format specification will read it as "two thousand twenty-six, June first" — year first, then month, then day.

If you're building an app that serves both Azure and ElevenLabs backends, you can't assume identical date formatting behavior.

You absolutely cannot. And that's the portability trap. Only the core W3C tags are portable. Extended tags like ElevenLabs' emotion or Azure's mstts colon express-as are vendor-specific and break silently on other platforms.

Which is one of those things that works perfectly in development and then explodes when you try to migrate providers.

Or when your client decides to switch TTS vendors six months after launch. I've seen production pipelines where the SSML was tuned for Polly, and when they moved to ElevenLabs, all the custom prosody tags worked fine but the emotion tags were silently dropped because Polly doesn't support them. The audio generated without errors — it just sounded flat and they couldn't figure out why for two weeks.

The silent failure mode. That's the dark side of ElevenLabs being forgiving about unsupported tags.

It's a real production concern. If Azure encounters a tag it doesn't understand, it throws an error and refuses to generate. That's annoying during development but valuable in production because you catch problems immediately. ElevenLabs' approach of silently ignoring unknown tags means you need a separate validation step in your pipeline.

You're saying the forgiving parser is actually a footgun in disguise.

It's a footgun wrapped in developer-friendly packaging. The solution is to validate your SSML against ElevenLabs' documented tag list before deployment. There are open-source validators that do this — you feed them your SSML and they flag any tags that won't be processed.

What about the expletive attribute in say-as? That seems oddly specific.

It's niche but useful for content moderation pipelines. If you set say-as interpret-as equals expletive on a word, ElevenLabs inserts a bleep sound instead of speaking the word. This is useful for generating clean versions of content that might contain profanity — you don't need to pre-process the text, you just tag the offending words and the engine handles the bleep.

It's built-in broadcast compliance.

In a single tag. And the bleep duration automatically matches the syllable count of the original word, so the pacing stays natural. It's a small feature, but it saves you from having to maintain a separate audio bleep asset and stitch it into the output manually.

What about the audio tag for embedding pre-recorded clips?

The audio tag lets you insert external audio files into the TTS output. ElevenLabs supports this for sound effects, brand jingles, or pre-recorded segments. The tag takes a src attribute pointing to a publicly accessible audio file URL. The engine generates the TTS audio, then splices in the external clip at the specified point.

What does that do to latency?

It increases latency by the duration of the embedded clip plus the fetch time for the external asset. If you're embedding a three-second jingle, your API call takes at least three seconds longer. For real-time applications like voice assistants, this can be a problem. For batch generation like audiobook production, it's negligible.

It's fine for offline workflows and borderline unusable for anything interactive.

Unless you pre-cache the audio assets on ElevenLabs' side, which they're reportedly working on. But as of now, yes — audio embedding is a batch-processing feature, not a real-time feature.

Let's talk about the cost implications of heavy SSML usage. Tags count toward character limits, right?

They do, and this catches people off guard. ElevenLabs' pricing model counts every character in the SSML document — including the tags themselves — toward your character quota. So a sentence that's one hundred characters of text might become two hundred characters with SSML markup. If you're generating at scale, those tags double your cost.

You're paying for the markup literally.

And it's not just ElevenLabs — this is standard across providers. But it means there's an economic incentive to be judicious with your tagging. Every emotion tag and break tag and phoneme tag is costing you money. You want to use exactly the markup you need and no more.

Which circles back to your point about over-tagging being the biggest mistake. It's not just an aesthetic problem — it's a cost problem.

A latency problem. Each tag adds processing overhead. The SSML parser has to parse the XML, validate it, extract the annotations, and condition the model accordingly. A heavily tagged document can add hundreds of milliseconds to your generation time compared to plain text.

There's a triangle of tradeoffs — expressiveness versus cost versus latency. You can have two but not all three.

That's the TTS developer's dilemma in a nutshell. And SSML is the control surface where you make those tradeoffs explicit. If you're building a real-time voice agent, you might strip SSML down to just the essential break and phoneme tags. If you're producing a premium audiobook, you go all in on emotion and contour because latency doesn't matter and the quality improvement justifies the cost.

Let me ask you something about the future. ElevenLabs has supposedly been floating an SSML two-point-oh proposal in developer forums — leaked in April. What's actually in it?

The leaked proposal is fascinating. It suggests adding conditional branching and variable binding to SSML. Conditional branching would let you write logic like "if the text contains a question mark, apply rising intonation" directly in the markup. Variable binding would let you define reusable prosody profiles — like "define this contour as 'dramatic pause' and apply it by reference.

They're turning SSML from a declarative annotation layer into a lightweight scripting language.

That's the direction. And it makes sense for interactive voice applications. Imagine a voice agent that adjusts its speaking style based on user sentiment — if the user sounds frustrated, the agent switches to a calmer, slower prosody profile. Right now, you'd need application logic to swap SSML templates. With conditional SSML, the markup itself could respond to input variables.

That's either brilliant or a terrible idea, and I'm not sure which.

It's both. The power is obvious — dynamic, context-aware speech generation without complex application code. The risk is that you're embedding logic in a markup language, which is exactly the kind of thing that made XSLT such a nightmare to debug. Markup languages aren't designed for control flow, and once you add it, the complexity explodes.

SSML could become the JavaScript of speech synthesis — incredibly capable, incredibly easy to abuse.

That's a perfect analogy. And just like JavaScript, the key will be restraint. The features exist for the cases where you need them, not as an invitation to script every intonation curve.

All of this is powerful, but only if you use it correctly. Let's distill this into three actionable rules of thumb.

Rule one: start with prosody rate and break. These two tags deliver about eighty percent of the expressiveness improvement with minimal complexity. Rate lets you control pacing — slow for emphasis, fast for excitement. Break lets you insert meaningful pauses. Master these before touching emotion or contour.

"master" means listening to the output, not just assuming the tags worked.

Which brings us to rule two: use phoneme for domain-specific vocabulary and sub for brand names. Never rely on the TTS engine's default pronunciation for critical terms. If your app mentions a drug name, a medical condition, a technical acronym, or a brand — tag it. The default pronunciation will be wrong often enough to matter.

Rule three: validate your SSML before deploying. ElevenLabs' silent error handling means malformed tags produce no warning, just bad audio. Run your SSML through a validator that checks against ElevenLabs' documented tag list. Catch the silent failures before your users do.

I'd add a practical step: download both the W3C SSML one-point-one spec and ElevenLabs' SSML documentation, then create a test script that exercises every supported tag. Not just the ones you think you'll use — all of them. Understanding the behavior boundaries of each tag will save you hours of debugging later.

The W3C spec is surprisingly readable for a standards document. It's about sixty pages, and the core tags are explained with examples.

ElevenLabs' documentation is good — they provide copy-paste-ready SSML snippets for common use cases. Between those two resources, you can go from zero to production-ready SSML in an afternoon.

Let's address a couple of misconceptions before we wrap. The first one is that SSML is only for robotic, old-school TTS engines and modern neural models don't need it.

That's completely wrong, and I hear it all the time. Even the best neural models benefit from explicit prosody cues, especially for long-form content. The model doesn't know your intent — it's making statistical guesses based on training data. SSML is how you communicate intent. For a single sentence, the difference might be subtle. For a thirty-minute narration, the difference between tagged and untagged is the difference between professional and amateur.

The second misconception is that more tags always mean better output.

Over-tagging creates unnatural artifacts. The neural network's own prosody model is often better than manual overrides for subtle expression. If you tag every pause and every pitch change, you get something that sounds technically precise but emotionally flat — like a musician who's so focused on hitting the right notes that they forget to play music.

The third: SSML is portable across providers.

Only the core W3C tags are portable. The emotion tag, Azure's express-as, and other vendor extensions will break silently or throw errors on other platforms. If portability matters to your application, stick to the W3C subset and test on every provider you might use.

Which brings us to the open question: as TTS models become more expressive natively, does SSML become obsolete, or does it evolve into something higher-level?

I think it evolves. The direction we're already seeing — from acoustic parameters like pitch and rate toward semantic labels like emotion and intent — suggests that SSML is becoming less about micromanaging sound waves and more about describing the communicative goal. The leaked SSML two-point-oh proposal with conditional branching points toward a future where SSML is a lightweight scripting language for emotional arcs, not just a set of acoustic annotations.

Instead of specifying "pitch plus twenty percent," you specify "this is the tense moment before the reveal," and the model handles the acoustic realization.

And that's a much more natural interface for creators. Most authors don't think in hertz and milliseconds — they think in scenes and emotions and pacing. SSML's evolution is about bridging that gap between creative intent and acoustic output.

Here's my challenge to listeners: build a single SSML document that uses at least five different tags — prosody, break, emotion, phoneme or sub, and say-as. Generate the output, then compare it to the same text with no markup. The difference will change how you think about TTS.

If you want a concrete starting point, take a paragraph from a book you like — something with dialogue and description — and mark it up for audio. Tag the character voices, add pauses for dramatic effect, adjust the pacing for action versus reflection. It's about thirty minutes of work, and what you learn will apply to every TTS project you do afterward.

Now: Hilbert's daily fun fact.

Hilbert: In nineteen twelve, a Portuguese newspaper in the Azores reported that a local fisherman caught a giant squid measuring exactly fifty-two feet in length — a record that stood for decades, though the original photograph mysteriously vanished from the archives two weeks after publication, and the newspaper itself folded within the year.

The squid, the photo, and the newspaper all disappeared.

The Azores: where giant squid go to retire and evidence goes to die.

Thanks to our producer Hilbert Flumingtop for that deeply suspicious fun fact. This has been My Weird Prompts. Find us at myweirdprompts dot com, or search for us on Spotify. Try the five-tag SSML challenge — we'd love to hear what you build.

Until next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#3189: Drawing the Melody: SSML's Hidden Power

Downloads

You Might Also Like

#3189: Drawing the Melody: SSML's Hidden Power