The Auto-Data Lake
7 min read
In 2010, James Dixon coined the term data lake while CTO at Pentaho. The insight was simple and mildly heretical at the time: stop trying to structure data before you store it. Dump it raw. Impose structure at read time. A generation of analytics infrastructure got built on that one shift.
Fifteen years later, a new workload is hitting the same wall. AI agents read and write context at a rate that makes traditional memory architectures absurd. Every conversation is a write. Every tool call is a read. The schemas change weekly because the agents change weekly.
What agents need is a data lake. But not a passive one. An active one — a lake that does work between the time a memory lands and the time another agent asks for it.
That's what I've been building. It's called Coquina, and I think it's the shape of agent memory infrastructure going forward.
The Agent Memory Problem
Run more than one AI tool and you'll feel it within a week. Claude Code makes an architecture decision on Monday. Your Slack bot doesn't know about it on Tuesday. Your overnight worker repeats a debugging path that was already ruled out. Next morning, Claude Code forgets the decision it made yesterday.
Every tool starts from zero. There is no shared memory layer. Memory is treated as a feature bolted onto individual tools instead of infrastructure shared across them.
The usual workarounds don't scale. CLAUDE.md files. Copy-pasting decisions into prompts. Wiki pages that go stale in a week. Knowledge graphs that demand a team of curators. Vector databases that return plausible-looking noise.
Every approach forces a choice between "too little structure to be useful" and "too much structure to maintain."
From Data Lake to Auto-Data Lake
A traditional data lake is passive storage with smart reads. An auto-data lake is active storage — three things happen automatically when a memory enters Coquina.
Auto-embedding. Every memory gets vectorized on ingest via a local Ollama nomic-embed-text model. Semantic search finds memories by meaning, not just keywords. No cloud API. No data leaves the network.
Auto-linking. Every write triggers a neighbor search across the vector space. Related memories get connected with typed edges — similar_to for semantically related content, superseded_by when a newer decision replaces an older one — each scored with a strength from 0 to 1 based on vector distance. Over 15,000 relationship edges exist right now. Zero were created manually.
Auto-clustering. Memories group by project and topic based on content similarity. Emergent structure, not predefined schemas.
By read time, the structure has already emerged.
Three Retrieval Paths, One Query
A single query_memory call fires three retrieval mechanisms in parallel and merges the results:
- Full-text search via PostgreSQL
tsvector— sub-2ms. - Semantic search via ChromaDB embeddings — meaning-based, catches what full-text misses.
- Graph traversal over auto-linked edges — surfaces memories neither search path would find, by following
similar_toandsuperseded_bychains from the top hits.
A recency bonus weights fresh memories higher. An A-MAC admission control gate quality-checks every write before it enters the store. If a memory fails to embed, search degrades gracefully to full-text instead of erroring.
The Cerebellum: A Graph That Learns
The fourth mechanism is the one I'm most excited about.
Cerebellum is a nightly worker in the Forge overnight chain that observes task outcomes and adjusts the knowledge graph based on what actually worked. Edges associated with successful task paths get reinforced. Edges associated with dead ends cool down. Over time, the graph learns which memories are genuinely useful versus which are just semantically nearby.
This is reinforcement learning over the memory layer itself. The lake doesn't just store and retrieve. It gets smarter the longer it runs.
What This Looks Like in Practice
Store a memory from Claude Code:
store_memory(
content="Sentinel Docker needs full path: /usr/local/bin/docker",
project="homelab", type="fact"
)
Within a couple hundred milliseconds: Postgres stores it, ChromaDB creates a vector embedding, the auto-linker finds neighbors and creates typed edges. It's immediately queryable from any connected agent.
Later, from a different agent:
query_memory(query="docker deployment issues on sentinel")
Full-text finds docker and sentinel. Semantic search surfaces related infrastructure memories. Graph traversal follows similar_to edges to adjacent facts the keywords never matched. The agent gets comprehensive context without knowing the exact words used when the memory was written.
The name "Coquina" comes from the shell-aggregate stone used to build the Castillo de San Marcos in St. Augustine — it absorbs impacts instead of shattering. The metaphor: memories hit the system and it gets denser. Time makes it stronger.
The Numbers
| Metric | Value |
|---|---|
| Active memories | 1,500+ |
| Projects tracked | 30+ |
| Relationship edges | 15,000+ (auto-generated) |
| Full-text search p50 | 1.37ms |
| Unified search p50 | 1.27ms |
| Connected agents | Claude Code, Slack, Home Assistant (voice), Forge workers, CLI |
| Overnight chain | 20 steps, ~7-15 minutes end-to-end |
| Workers | 27 autonomous |
| Infrastructure | 3 machines and a NAS in my house, no cloud dependencies |
These are live numbers from /api/public/stats, not projections. You can see them updating on the systems page.
MCP: The Protocol Layer
Coquina speaks MCP — the Model Context Protocol Anthropic introduced for tool interoperability. Any agent that speaks MCP can connect without custom integration code. Sixteen MCP tools exposed, from store_memory and query_memory to link_memories and run_hygiene.
The interop story is consolidating faster than I expected. Google adopted MCP for Colab GPU runtimes. Anthropic shipped AutoDream — memory consolidation inside Claude Code — which validates the problem but keeps solving it per-tool. LangChain launched Deep Agents. The market is forming.
Coquina is deliberately infrastructure-shaped: raw memory, auto-linking, reinforcement learning, schema-on-read, with an orchestration layer already running autonomously. An open protocol beneath the agents, not a SaaS on top of them.
Why I'm Writing This Now
A fair question. The repo isn't public yet. There's no docker compose up experience. Grain Studios is real but early. Most of the documentation lives in my head and in Coquina itself.
Announcing a thesis and shipping a product are different things, and I think the thesis is ready even if the product isn't. So this post is the thesis, stated plainly:
AI agents need an active memory infrastructure layer. No one is building it as infrastructure. Everyone is building it as a feature bolted to a single tool. The auto-data lake is the category. Coquina is my attempt at the category-defining implementation.
I'd rather put it in writing now and iterate publicly than ship a polished artifact to silence. If the thesis is wrong I want to know before I spend another six months on packaging. If it's right I want the conversation to start early.
What's Next
The short-term roadmap is unglamorous on purpose:
docker compose upfrom a clean clone — no host dependencies, no workarounds- A real config file —
cortex.yamlreplacing every env var and hardcoded path - A source-available license on the repo
- A landing page that isn't this blog
- An early-access form for managed hosting
After that: multi-tenancy, then public launch.
The data lake changed how enterprises think about storage. The auto-data lake is the same shift for AI agents — stop trying to organize memory upfront, let the lake build its own structure, and make it learn from what works.
If you're running agents in production and the memory problem is real for you, I'd love to hear what's breaking in your stack. Email's fastest: reed@grainlabs.io.
This is part of a series on building autonomous AI infrastructure on consumer hardware. Start with Building a Brain for My Homelab for the origin story, or read From Side Project to Product for how a homelab experiment turned into a company.