#2326: Voice Control Simplified: Home Assistant’s Local Stack

Discover how to build a reliable, vendor-agnostic voice control system for Home Assistant without relying on Amazon or Google.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2484
Published: Apr 19
Duration: 24:10
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Claude Sonnet 4.6
Topics: smart-home local-ai voice-cloning

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Building Reliable Voice Control with Home Assistant

Voice control in Home Assistant has long been a challenge for users who want a seamless experience without relying on Amazon or Google ecosystems. Recent advancements, however, have made it easier than ever to build a reliable, vendor-agnostic system that works locally. Here’s how to do it.

The Four Layers of Voice Control

Voice control isn’t a single technology—it’s a stack of four interconnected layers: wake word detection, speech-to-text (STT), intent recognition, and text-to-speech (TTS). Each layer has its own hardware and software requirements, and Home Assistant’s Wyoming protocol connects them all locally, eliminating the need for cloud dependencies.

Solving the Alias Problem

In the past, Home Assistant users had to manually create aliases for every possible phrasing of a command, which was time-consuming and frustrating. The introduction of an AI semantic layer has streamlined this process. Now, commands like “kill the lights in the kitchen” are automatically mapped to the correct entities without requiring extensive alias lists.

Hardware Considerations

The Raspberry Pi 5 is the current recommendation for running a local voice control stack. Its improved processing power reduces latency to under a second, making the system feel responsive rather than sluggish. For wake word detection and voice capture, the reSpeaker 2-Mics Pi HAT from Seeed Studio is ideal, offering dual microphones and beamforming for far-field pickup. A basic USB speaker suffices for TTS output.

Choosing the Right Software

For STT, Speech-to-Phrase is recommended for predictable commands like turning lights on or off. It’s faster and lighter than Whisper, which is better suited for open-ended natural language. On the TTS side, Piper provides neutral, non-robotic voice output that’s easy to understand.

Staying Vendor-Agnostic

One of the biggest advantages of this setup is its independence from big tech ecosystems. If Amazon or Google discontinues a device or changes an API, your system remains unaffected. This local approach ensures long-term reliability and privacy.

By following this guide, you can build a voice control system for Home Assistant that’s both reliable and customizable, without the compromises of cloud-based solutions.

Mentions

Home Assistant Open-source home automation platform
Home Assistant Voice Preview Edition Pre-built voice satellite hardware
Piper Local text-to-speech engine
Raspberry Pi 5 Single-board computer for local voice
reSpeaker 2-Mics Pi HAT Dual microphone HAT for Raspberry Pi
Seeed Studio Manufacturer of reSpeaker hardware
Speech-to-Phrase Lightweight STT for Home Assistant intents
Wyoming protocol Open protocol for voice satellites

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2326: Voice Control Simplified: Home Assistant’s Local Stack

Daniel sent us this one, and honestly it's a question a lot of people hit eventually. He's been trying to get voice control working in Home Assistant, specifically now that Ezra's old enough to be moving around the house and hands-free control actually matters. He used Assist, it was frustrating, commands needed repeating, the alias approach was a time sink. Home Assistant has since added an AI semantic layer which helps, but the underlying stack still needs hardware: a microphone, a speaker, wake word detection, speech-to-text, text-to-speech. He wants to know the simplest vendor-agnostic path, whether a Raspberry Pi with a generic speaker and mic is viable, and whether there are pre-made combos worth considering that don't lock you into someone's ecosystem.

That word "ecosystem" is doing a lot of work there, because the reason people end up frustrated with voice control in Home Assistant isn't usually the software, it's that they've inherited assumptions from the Amazon and Google model, where the whole thing is a sealed box. You say a word, a server somewhere processes it, a response comes back. The box handles everything and you never see the seams. Home Assistant is the opposite of that. Every seam is visible, which is great when you want to customize and terrible when you just want it to work.

Which is exactly Daniel's situation. He doesn't want to spend a weekend soldering. He has a toddler. He wants to say "turn off the kitchen lights" once, not four times.

Right, and the good news is that the gap between "this requires a weekend of tinkering" and "this mostly just works" has closed significantly. Home Assistant rolled out proper AI integration for the voice pipeline at the start of this year, which changed the calculus on the software side considerably. The semantic layer means you're no longer hand-writing aliases for every entity variation. You say "kill the lights in the kitchen" and the system figures out you mean the kitchen ceiling light, the under-cabinet strip, whatever you've got in there.

By the way, today's episode is powered by Claude Sonnet four point six.

Which, if you think about it, is not entirely unrelated to what we're about to discuss. Language models doing useful work in the background.

So the alias problem was real. I remember when that was the recommended approach and it was genuinely painful. You'd have to anticipate every possible phrasing a person might use and pre-register it. Which is not how humans talk, especially not how a child talks.

It's not how anyone talks. And the failure mode, to be blunt about it, was that the system would just reject anything it didn't recognize. No graceful degradation. You'd get silence, or a "device not found" response, and then you'd say it again slightly differently, and again, and by the third attempt you've just walked over and hit the switch. Which defeats the entire point.

I actually went through a version of this myself. I had an entity called "office ceiling light" and I kept saying "office light" and it would just fail silently. So I added "office light" as an alias. Then I said "the light in the office" and that failed too. I think I ended up with eleven aliases for one bulb before I gave up and just used the app. That's the tax the old system imposed on you.

It scales terribly. One bulb with eleven aliases is annoying. A house with thirty entities is completely unmanageable. You'd spend more time maintaining the alias list than you'd ever save in convenience. The AI layer solves that at the root rather than asking you to patch around it.

The AI layer fixes the recognition problem. But Daniel's question is really about the hardware side of things, because even with perfect intent recognition, you still need something in the room that can hear you and talk back.

Which is where it gets interesting, because there are more options now than people realize, and the Raspberry Pi path is more viable than it used to be.

Let's actually name the stack, because I think that's where a lot of people get lost. They think of voice control as one thing, when it's actually four separate problems chained together.

Which is exactly why it breaks in unpredictable places. You've got wake word detection, that's the always-on listener waiting for the trigger phrase. Then speech-to-text converts what you actually said into text. Then a conversation engine interprets that text and figures out what action to take. Then text-to-speech turns the response back into audio. Four layers, four potential points of failure, and each one has its own hardware and software requirements.

They all have to talk to each other. Which in Home Assistant's case is handled by the Wyoming protocol, their open standard for connecting voice satellites. Worth knowing that name because you'll see it everywhere in the documentation.

Right, Wyoming is basically the plumbing. It's what lets a Raspberry Pi sitting in your kitchen communicate with the Home Assistant instance running on your server, or your NUC, or wherever you've got it. Local, no cloud dependency, which is the whole point if you're trying to avoid the Amazon and Google model.

The vendor-agnostic angle matters here beyond just privacy. It's about not being held hostage to someone else's product decisions. Amazon discontinues a device, changes an API, decides your Echo Dot is no longer supported, and suddenly your smart home has a hole in it.

Three years ago the Alexa integration for Home Assistant broke twice in one quarter because Amazon changed something on their end. That's the tax you pay for convenience. The local stack doesn't have that problem. Nobody can reach into your network and deprecate your microphone.

It's not hypothetical either. Google killed Google Home routines integration with a bunch of third-party platforms with about six weeks notice. People had automations they'd built over years that just stopped working. If your whole voice layer is sitting on top of someone else's API, that's the risk you're carrying.

The local stack doesn't eliminate all risk—hardware fails, software has bugs—but the failure modes are ones you can see and fix yourself. You're not waiting for a company to decide your use case is worth their engineering time.

The hardware question is really the remaining piece. And that's where Daniel's asking: is a Raspberry Pi with something generic actually enough to make this reliable?

It is—with some caveats. The Raspberry Pi 5 is meaningfully better than the 4 for this use case, partly because of raw processing headroom, but more importantly because wake word detection has gotten lighter and faster, and the Pi 5 handles the full pipeline without the latency spikes you used to see.

What does that latency difference actually look like in practice? Because I think that's where people gave up on the Pi path a few years ago.

On a Pi 4 running a heavier speech-to-text model, you could be looking at two to four seconds between finishing your sentence and getting a response. Which is long enough to feel broken. On a Pi 5 with Speech-to-Phrase, which is the lightweight STT option Home Assistant now recommends for lower-powered hardware, you're down to under a second. That's the threshold where it stops feeling like a system and starts feeling like a conversation.

There's actually a useful analogy here. Human conversational response latency—the gap before someone replies in a normal exchange—is around two hundred milliseconds. Anything under about eight hundred milliseconds still feels responsive. Once you cross a second, the brain starts registering it as a delay rather than a pause. Two to four seconds is firmly in "something is wrong" territory. So that Pi 5 improvement isn't just a spec sheet number, it's crossing a perceptual threshold.

Which is exactly why the older Pi path got a bad reputation. The hardware was technically capable, but it was operating in the range where the latency itself became the user experience. People weren't wrong to give up on it. They were right to give up on it at the time.

Speech-to-Phrase versus Whisper. That's the core tradeoff on the STT side, right?

Whisper, specifically faster-whisper which is the accelerated version, handles open-ended natural language. You can say anything and it tries to transcribe it. Speech-to-Phrase is the opposite. It only recognizes phrases that map to known Home Assistant intents. So it's faster and lighter, but if you say something outside its vocabulary it just fails. For a household where the commands are pretty predictable, turn on the lights, set the thermostat, lock the front door, Speech-to-Phrase is probably the right call.

Which actually suits the use case Daniel described. He's not trying to have a philosophical conversation with his ceiling. He wants lights off in the kitchen.

For text-to-speech on the output side, Piper is the current standard recommendation. It runs locally, it's reasonably fast, and the voice quality has improved to the point where it doesn't sound like a telephone menu from the early two thousands.

That's actually an underrated part of the experience. If the response voice sounds robotic and grating, you start dreading the confirmation and eventually you stop using the system. Piper is at the point where it's just a neutral voice that tells you what happened. It's not going to win any awards but it doesn't make you wince either.

The voice quality bar for this use case is pretty low, honestly. You're not asking it to read audiobooks. You need it to say "turning off the kitchen lights" in a way that doesn't make you feel like you're interacting with a 2003 GPS unit.

Now the microphone question. Because a generic USB microphone and the reSpeaker HAT are not the same thing, and I think people underestimate how much the capture hardware matters.

The reSpeaker 2-Mics Pi HAT from Seeed Studio is the most commonly recommended purpose-built option. It sits directly on the Pi's GPIO pins, it's designed for voice capture specifically, dual microphones, and it handles far-field pickup reasonably well. The alternative is a USB microphone, which works, but you're doing more work in software to compensate for what the hardware isn't doing.

How far out are we actually talking for far-field? Because "far-field" can mean anything from three feet to across the room.

Practically, the reSpeaker handles five to six feet reliably in a normal room. Across a large open kitchen with an extractor fan running, you're going to have a harder time regardless of what microphone you use. But the dual-mic array does beamforming, which means it's actively trying to isolate the direction the voice is coming from and suppress noise from other directions. A single USB microphone is omnidirectional, it picks up everything equally, and then you're relying entirely on the wake word detection software to filter out the refrigerator hum.

The speaker side?

A basic powered USB speaker is fine for Piper's output. The Home Assistant Voice Preview Edition, which runs about sixty dollars, bundles the speaker and feedback LED into one unit, but that's their own hardware. If you want fully generic, a small powered speaker over the headphone jack or USB audio adapter does the job.

The practical build is: Pi 5, reSpeaker HAT, a small powered speaker, Speech-to-Phrase for STT, Piper for TTS, Wyoming protocol connecting it back to your Home Assistant instance.

That's the stack. And none of those components are owned by Amazon or Google. If Seeed Studio disappears tomorrow, your reSpeaker still works. That's the vendor-agnostic property Daniel was asking for. But honestly, the bigger challenge isn't building it—it's living with it.

The part that doesn't get enough attention is what happens after you've built it. Because the build is one afternoon, but the thing lives in your house for years.

That's where the vendor-agnostic choice starts paying compounding dividends. With an Amazon Echo, you get a smooth setup and then you're on Amazon's schedule for everything. Feature updates, deprecations, price increases on the skills you rely on. With the local stack, the device you built in April is functionally identical in three years unless you choose to change something.

There's a security dimension to this that I don't think gets discussed enough in the Home Assistant community. When you're running Whisper or Speech-to-Phrase locally, your audio never leaves your network. With Alexa, every utterance after the wake word is transmitted to Amazon's servers for processing. That's not a conspiracy theory, that's just how the architecture works.

For a household with a young child, that's not a trivial consideration. You're not just capturing "turn off the kitchen lights." Over time you're capturing the ambient texture of your home life. What time people wake up, what rooms they're in, conversational fragments. Amazon's privacy policy covers how they handle that data, but the simplest privacy guarantee is data that never leaves your house in the first place.

Which is something you actually can't buy from the vendor-locked side. You can pay more, you can buy the premium tier, but you can't purchase local processing from Amazon. It's not on offer.

The energy angle is smaller but worth mentioning. A Pi 5 draws somewhere around five to eight watts under typical load. An Amazon Echo Dot draws about three watts idle. So the power consumption is comparable, but the Pi is doing everything locally. If you scale that to a house with four or five voice satellites, you're still talking about under fifty watts total for the whole voice infrastructure. That's not nothing, but it's also not a meaningful line item on your electricity bill.

The bigger energy win is actually what the voice control enables downstream. Lights that reliably respond to voice commands get turned off more consistently. Heating zones that you can adjust without walking to a panel get adjusted more often. The voice interface lowers the friction on the energy-saving behaviors.

There's a study I keep thinking about on this, from the smart building literature, where the single strongest predictor of whether occupants actually use energy-saving features isn't the feature itself, it's the number of steps required to activate it. You go from three steps to one step and compliance roughly doubles. Voice is the extreme version of that reduction.

Which for Daniel specifically, with Ezra in the picture, has a very practical shape. A two-year-old cannot operate a light switch reliably. A two-year-old can be in a room and you can say "lights off" without putting down whatever you're holding.

The system doesn't care that you said it while holding a child and your voice was slightly muffled. That's where the reSpeaker's far-field pickup earns its place. A single USB microphone pointed at a wall is going to struggle with that scenario.

How does this compare to just buying an Echo and pointing it at the Home Assistant integration? Because that path exists and some people do take it.

It works, and I don't want to be dismissive of it, but you're accepting a dependency that can break on Amazon's schedule. The Home Assistant Alexa integration has had integration breakages historically when Amazon updates their API. You're also accepting that Alexa's natural language processing is doing the intent parsing, which means the behavior of your smart home is partly determined by how Amazon trains their models. That's a subtle form of lock-in that people don't notice until something changes.

There's something slightly odd about the architecture of it, right? Your voice goes to Amazon's servers, gets processed, comes back to your house, and then Home Assistant acts on it. The command is traveling thousands of miles to flip a light switch ten feet away.

Which also means if your internet is down, your voice control is down. The local stack doesn't have that dependency. The Wyoming protocol is communicating entirely within your network. You could pull the ethernet cable from your router and your voice satellites would still work.

The local stack, once it's running, has no external dependencies by definition.

Which is the thing that actually makes it more reliable long-term, even if the initial setup is more involved. The ceiling on reliability for a cloud-dependent system is set by the cloud provider. The ceiling on a local system is set by your hardware and your configuration, both of which you control.

For someone like Daniel who has already been through one frustrating cycle with Assist and aliases, the pitch is: the frustration you experienced was real, it was a legitimate limitation of the older approach, and the current stack is meaningfully different in ways that address the specific failure modes you hit.

Specifically the recognition failures. The combination of the AI semantic layer and Speech-to-Phrase means the system is much more tolerant of natural phrasing variation. You're not writing aliases anymore. You're describing your intent in ordinary language and the system is doing the work of mapping that to an action.

The hardware is finally at a price point where the vendor-agnostic path doesn't require a significant premium over just buying an Echo. A Pi 5 with the reSpeaker HAT and a small speaker is in the same ballpark as a mid-range Echo device, and you own all of it outright.

The sixty-dollar Home Assistant Voice Preview Edition is actually the interesting middle option there. It's their own hardware, so there's some degree of Home Assistant ecosystem dependency, but it's not Amazon or Google, it's an open-source project, and the underlying protocols are all open. If the Home Assistant project ceased to exist tomorrow, which is not a realistic scenario, but if it did, you'd still have a device running Wyoming protocol that you could point at a fork.

That's a meaningfully different kind of lock-in than being tied to Amazon's commercial interests.

Entirely different in kind. One is a community project with open protocols. The other is a revenue line for a trillion-dollar company. Those are not equivalent risks—and that's exactly why figuring out where to start matters.

If someone is sitting with this and thinking, where do I actually start, what's the honest answer?

Raspberry Pi 5 and the reSpeaker 2-Mics HAT. That's the foundation. You're looking at roughly eighty to ninety dollars for those two components combined, and they give you a platform that handles the hardest part of the problem, which is clean voice capture in a real room with ambient noise.

Then Speech-to-Phrase for STT if the command vocabulary is predictable, Piper for TTS, Wyoming protocol to connect it back to Home Assistant. That's the stack.

The other thing to do immediately is go into Home Assistant's settings, under Voice Assistants, and make sure the AI conversation agent is active rather than the old rules-based one. That's the change that removes the alias problem Daniel was running into. The system stops requiring exact phrase matches and starts understanding intent.

Which is a software configuration step, not a hardware purchase. You can do that today on whatever setup you already have and see immediate improvement.

The AI semantic layer doesn't require new hardware. It requires enabling the right conversation backend. For anyone already running Home Assistant who has been frustrated by recognition failures, that's the first thing to try before buying anything.

The vendor-agnostic principle throughout this. Nothing in that stack requires a relationship with Amazon or Google. The reSpeaker is a Seeed Studio product, Piper and Speech-to-Phrase are open source, Wyoming is Home Assistant's own open protocol.

The practical implication of that is you're not exposed to someone else's business decisions. The components you assemble today will still be working in the same way in three years because nothing external can change their behavior.

Leaves you in charge of your own ceiling.

Which is, honestly, the whole point of running Home Assistant in the first place—efficiency and accessibility.

Speaking of efficiency, the question that stays with me is where this goes when the models get smaller. Because right now, running Whisper at full capacity still wants a GPU. Speech-to-Phrase is fast precisely because it's constrained. But the gap between those two options has been closing steadily, and at some point you get open-ended natural language recognition running comfortably on a Pi without the tradeoff.

That's the trajectory that makes local voice exciting rather than just principled. Right now you're making a choice between speed and flexibility. In a few years that choice probably disappears, and you get both on commodity hardware. The models are shrinking faster than the hardware is improving, which means the crossover point arrives sooner than most people expect.

When that happens, the vendor-agnostic path doesn't just become viable, it becomes obviously superior. Because you'll have the same capability as the cloud systems with none of the dependency.

The other open question is multimodal. Voice is one input channel. The homes being built now have cameras, presence sensors, context about who's in which room. A system that combines all of that with voice could handle ambiguous commands in ways that pure voice recognition can't. "Turn off the lights" means something different at two in the afternoon than it does at midnight with a sleeping child in the next room.

Which is either useful or mildly unsettling depending on how you feel about your home knowing what time you go to bed.

The local processing argument applies even more strongly there. That level of contextual data absolutely should not be leaving your network.

There's a version of that which is already partially possible in Home Assistant today, right? You can set up presence detection per room with Bluetooth or mmWave sensors, and combine that with voice so that "turn off the lights" without a room qualifier defaults to wherever you're standing. It's not fully context-aware in the multimodal sense, but it's a step in that direction that doesn't require waiting for the models to catch up.

Which is a good example of the local stack's composability. You're assembling capabilities from open components and they can talk to each other because they're all running on the same platform. Adding a presence sensor to the voice logic is a Home Assistant automation. You're not waiting for Amazon to ship a feature. You just build it.

Big thanks to Hilbert Flumingtop for producing the show, and to Modal for keeping our infrastructure running without making us think about it. That's the best thing you can say about infrastructure.

If you want to dig into this further, everything we mentioned is documented in the Home Assistant Wyoming integration pages. And you can find all two thousand two hundred and forty episodes at myweirdprompts.

This has been My Weird Prompts. Leave us a review if the show has been useful to you, and we'll see you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2326: Voice Control Simplified: Home Assistant’s Local Stack

Building Reliable Voice Control with Home Assistant

The Four Layers of Voice Control

Solving the Alias Problem

Hardware Considerations

Choosing the Right Software

Staying Vendor-Agnostic

Mentions

Downloads

You Might Also Like

#2326: Voice Control Simplified: Home Assistant’s Local Stack