Building a Voice Assistant That Never Phones Home
2 min read
I wanted a voice assistant that actually worked — and that didn't send every word I said to a cloud server. So I built one from scratch on my homelab, and I gave it GLaDOS's voice.
The Stack
The entire pipeline runs across my homelab:
- Whisper (small-int8) for speech-to-text on sentinel
- Piper TTS (lessac-medium) for standard text-to-speech
- GLaDOS TTS for the voice I actually use
- OpenWakeWord listening for the wake word
- Home Assistant as the conversation router and home control layer
- Gemma 4 (9.6GB, running on forge via Ollama) as the language model with native function calling
When I say "turn on the kitchen lights," the audio goes from microphone to wake word detection to Whisper to Home Assistant's conversation agent to Gemma 4 for intent parsing to Home Assistant's service calls. The lights come on. All local. Latency is under 2 seconds.
Why Local Matters
Every commercial voice assistant sends your audio to cloud servers. My voice assistant fails when I unplug my Mac Mini. That's it. No service dependencies. No API deprecations. No forced firmware updates.
And I can give it whatever personality I want. GLaDOS delivering my morning briefing is objectively delightful.
The Function Calling Breakthrough
Gemma 4's native function calling made this work. Previous local models couldn't reliably translate "turn on the kitchen lights" into a structured Home Assistant service call. Gemma 4 handles it natively — no fine-tuning, no prompt hacking. Just a well-defined tool schema.
At 27 tokens per second on Apple Silicon with the MLX backend, inference is fast enough for interactive use.
What's Next
Custom "ok GLaDOS" wake word training via microWakeWord. ESP32-S3-BOX-3 satellites for always-listening in every room. And connecting the voice layer to Cortex — so when I ask "what did I work on yesterday?" it queries the memory graph and answers from actual context.
The voice assistant isn't the product — it's proof that local-first AI works for real-time interactive use. If the pipeline can handle voice commands at 2-second latency with no cloud, it can handle any agent workload. The hard part isn't the models. It's the infrastructure that connects them. That's what Grain Studios builds.