Building a Voice Assistant That Never Phones Home

I wanted a voice assistant that actually worked — and that didn't send every word I said to a cloud server. So I built one from scratch on my homelab, and I gave it GLaDOS's voice.

The Stack

The entire pipeline runs across my homelab:

Whisper (small-int8) for speech-to-text on sentinel
Piper TTS (lessac-medium) for standard text-to-speech
GLaDOS TTS for the voice I actually use
OpenWakeWord listening for the wake word
Home Assistant as the conversation router and home control layer
Gemma 4 (9.6GB, running on forge via Ollama) as the language model with native function calling

When I say "turn on the kitchen lights," the audio goes from microphone to wake word detection to Whisper to Home Assistant's conversation agent to Gemma 4 for intent parsing to Home Assistant's service calls. The lights come on. All local. Latency is under 2 seconds.

Why Local Matters

Every commercial voice assistant sends your audio to cloud servers. My voice assistant fails when I unplug my Mac Mini. That's it. No service dependencies. No API deprecations. No forced firmware updates.

And I can give it whatever personality I want. GLaDOS delivering my morning briefing is objectively delightful.

The Function Calling Breakthrough

Gemma 4's native function calling made this work. Previous local models couldn't reliably translate "turn on the kitchen lights" into a structured Home Assistant service call. Gemma 4 handles it natively — no fine-tuning, no prompt hacking. Just a well-defined tool schema.

At 27 tokens per second on Apple Silicon with the MLX backend, inference is fast enough for interactive use.

What's Next

Custom "ok GLaDOS" wake word training via microWakeWord. ESP32-S3-BOX-3 satellites for always-listening in every room. And connecting the voice layer to Coquina — so when I ask "what did I work on yesterday?" it queries the memory graph and answers from actual context.

The voice assistant isn't the product — it's proof that local-first AI works for real-time interactive use. If the pipeline can handle voice commands at 2-second latency with no cloud, it can handle any agent workload. The hard part isn't the models. It's the infrastructure that connects them. That's what Grain Studios builds.