Skip to main content
Back to Portfolio
Case Study
AI/Infrastructure

Grain Studios: Forge Agent Runtime

Queue-Based Autonomous Worker System

Role
Architect & Developer
Platform
Python, Redis, macOS
Industry
AI Infrastructure
Focus
Task Orchestration

Project Overview

I spent every summer, every other winter, and every spring break in Washington State. The Mt. St. Helens blast zone was my motorcycle playground — miles of volcanic terrain where nothing was supposed to grow back, and everything did. That's the energy here: build in harsh conditions, make it self-sustaining, don't look back.

Forge is an event-driven worker system for orchestrating AI tasks on Apple M4 Silicon. Redis-backed queues coordinate GPU access, health monitoring, code reviews, research scanning, and overnight chain orchestration — all running autonomously via 41 macOS LaunchAgents with self-healing capabilities.

The nightly chain runs a 20-step DAG — from GPU warmup through security review, code review, research scanning, ecosystem monitoring, knowledge consolidation, and morning briefing. Currently running 27 specialized workers. Every one of them runs while I sleep.

Key Features

  • GPU Locking — Redis SET NX EX mutex preventing concurrent GPU access, with automatic Ollama model unloading to free VRAM before training jobs.
  • Deterministic Triage Gate — Pre-LLM scoring on the Amygdala worker that short-circuits to GREEN when all health checks pass, skipping expensive LLM inference entirely.
  • Adapter Versioning — Semantic versioning with symlinked directories, enabling instant rollback of LoRA adapters to any previous version.
  • Self-Healing Daemon — Proactive scanner runs every 60 seconds checking all 41 LaunchAgents, kickstarting any that crashed. Reactive listener handles service-level recovery (Ollama restart, SSD remount, queue flush). Max 3 attempts per signal with cooldown timers.
  • Overnight Chain Orchestration — 20-step DAG running from GPU warmup through security review, code review, research scanning, ecosystem monitoring, knowledge consolidation, and morning briefing during off-hours. Weekly consolidation compresses knowledge on Sundays.
  • 27 Specialized Workers — Health checks, morning briefings, Amygdala threat assessment, GPU warmup, article scanning, arXiv research scout, nightly code review, PR digest, site analytics, ecosystem watch, evolution monitoring, weekly consolidation, atlas compilation, autonomous Claude agent, and sandbox builder (generates complete iOS apps via Ollama + xcodegen + xcodebuild) — each running as a macOS LaunchAgent daemon.
  • Coquina Integration — All workers authenticate with Coquina API keys to store reports and findings as persistent memories, building institutional knowledge automatically.
  • Slack Notifications — Real-time alerts for task completion, failures, and GPU contention pushed to Slack channels with Block Kit formatting.

Technical Approach

I designed this around Redis BLPOP/RPUSH queues — reliable, ordered, no surprises. Each worker is a LaunchAgent daemon with its own heartbeat, retry logic, and error handling. The GPU coordinator prevents resource contention between inference and training.

The overnight chain runs a 20-step DAG without human intervention, leveraging off-peak hours when the GPU is idle. I wake up to a morning briefing that summarizes everything the system did overnight.

The deterministic triage gate on Amygdala eliminates unnecessary LLM calls by applying rule-based scoring first — only anomalies that survive filtering reach the language model. Don't burn compute on problems you can solve with math.

Outcome

Forge runs 27 autonomous workers — health checks, article scanning, arXiv research, threat assessment, code review, ecosystem monitoring, knowledge consolidation, and autonomous app generation. The 20-step overnight chain runs every night without me touching it.

Self-healing means the system recovers from failures on its own. Workers restart, tasks retry, GPU locks release cleanly. The deterministic triage gate cut unnecessary LLM inference on routine health checks.

Every worker's findings persist as searchable institutional memory via Coquina. The system remembers what it learned. So do I.

Interested in working together?

Get in Touch