The Auto-Data Lake
3 min read
In 2010, James Dixon coined the term "data lake" while CTO at Pentaho. The core insight: stop trying to structure data before you store it. Dump everything raw. Impose structure at read time.
That paradigm shift changed how enterprises handle data. It also describes exactly what I needed for AI agent memory — except the data lake needs to do more work.
The Agent Memory Problem
Here's what happens when you run multiple AI tools: Claude Code makes an architecture decision. Your Slack bot has no idea. Your overnight worker repeats a debugging approach that was already ruled out. Next morning, Claude Code forgot the decision it made yesterday.
Every tool starts from zero. There's no shared memory layer.
From Data Lake to Auto-Data Lake
A traditional data lake is passive storage with smart reads. An auto-data lake is active storage. Three things happen automatically when a memory enters Cortex:
Auto-embedding. Every memory gets vectorized on ingest via ChromaDB. Semantic search that finds memories by meaning, not just keywords.
Auto-linking. Every write triggers a neighbor analysis across the vector space. Related memories get connected with typed edges (similar_to, superseded_by) scored by strength. Over 14,000 relationship edges exist — none created manually.
Auto-clustering. Memories naturally group by project and topic based on content similarity. Emergent structure, not predefined schemas.
Three retrieval mechanisms exploit this structure:
- Full-text search via PostgreSQL tsvector — under 3ms
- Semantic search via ChromaDB embeddings — meaning-based
- Graph traversal via auto-linked edges — discovering connections neither search path would surface
And now, a fourth mechanism: temperature-weighted retrieval. The Cerebellum worker reinforces edges nightly based on task outcomes. Warm edges surface higher. Cold edges sink. The graph learns which memories are actually useful.
What This Looks Like in Practice
Store a memory from Claude Code:
store_memory(
content="Sentinel Docker needs full path: /usr/local/bin/docker",
project="homelab", type="fact"
)
Within 200 milliseconds: PostgreSQL stores it, ChromaDB creates a vector embedding, the auto-linker finds neighbors and creates typed edges. Immediately queryable from any connected agent.
Later, a different agent queries:
query_memory(query="docker deployment issues on sentinel")
Three retrieval paths fire. Full-text finds "docker" and "sentinel." Semantic search finds SSH configs. Graph traversal follows edges to related infrastructure facts. The worker gets comprehensive context without knowing the exact keywords used when storing.
The Numbers
| Metric | Value |
|---|---|
| Active memories | 1,400+ |
| Projects tracked | 30 |
| Relationship edges | 14,100+ (auto-generated) |
| Connected agents | Claude Code, Slack, Forge workers, CLI |
| FTS latency | Under 3ms |
| Infrastructure | 3 machines + NAS, no cloud |
MCP: The Protocol Layer
Cortex speaks MCP — the Model Context Protocol that Anthropic created for tool interop. Any agent that speaks MCP can connect without custom integration code. 16 MCP tools exposed. Google recently adopted MCP for Colab GPU runtimes. The standard is consolidating.
What's Next
The auto-data lake concept has a natural extension: auto-reasoning. The Cerebellum is the first step — reinforcement learning over the knowledge graph. The next step is detecting contradictions, identifying knowledge gaps, and synthesizing insights from graph patterns.
The data lake changed how enterprises think about storage. The auto-data lake is the same shift for AI agents — stop trying to organize memory upfront, let the lake build its own structure, and make it learn from what works. If you're building agents that need to remember things across sessions and tools, this is the architecture I'd bet on. I'd love to hear how you're approaching it: reed@grainlabs.io.