Reversibility as a Virtue
8 min read
The first version of every infrastructure change I shipped this month had a quiet failure mode: I could roll it forward, but rolling it back would have hurt.
Not catastrophically. Just enough to make me hesitate before applying it. Just enough to convert "let's try this" into "let's commit to this." That hesitation is expensive. It slows iteration. It encourages overthinking. It makes "ship and see" feel like a bigger decision than it actually is.
So I changed how I write infrastructure changes. Every one of them now ships with its own escape hatch.
What Reversibility Looks Like in Practice
I shipped a new sibling Postgres database last week — coquina_lake — for raw session telemetry. It's a meaningful change to the persistence layer. Adding a database, creating a tablespace on an external volume, adding new application code paths, registering new endpoints, updating the dashboard.
Five concrete things would have to be undone, in the right order, if I wanted to remove it cleanly:
- The schema (the
session_telemetrytable and its indexes and views). - The database itself (
coquina_lake). - The tablespace (
node02_lake). - The application connection pool (the second
psycopg2pool keyed onLAKE_DATABASE_URL). - The federated search code path that merges lake results into the unified search.
Each of those has a forward direction and a reverse direction. In the version of this work that I'd have written six months ago, I'd have shipped only the forward direction. The reverse would be "if anything goes wrong, figure it out."
Instead, every step has its undo encoded into the source.
migrations/005_create_lake_db.up.sql
migrations/005_create_lake_db.down.sql
migrations/006_session_telemetry.up.sql
migrations/006_session_telemetry.down.sql
The down migrations aren't placeholders — they're tested. 005_create_lake_db.down.sql uses Postgres 13+'s DROP DATABASE coquina_lake WITH (FORCE) to atomically terminate active sessions and drop the database in one statement, no race window. 006_session_telemetry.down.sql drops the table, the FTS trigger, the function, the view. Both run cleanly against the database they manage.
The application side is even simpler: the new pool is gated on a single environment variable. If LAKE_DATABASE_URL is unset, the federation no-ops, the lake search returns an empty list, and the codebase behaves exactly as it did before. To turn the whole thing off without touching code, I unset the variable and restart the service.
That's three different layers of escape hatch for one change. By design.
Why This Matters More Than I'd Expected
I want to be honest about why I started doing this.
It wasn't because I had a theory of clean infrastructure design. It was because I'd shipped a few changes that were technically fine but emotionally hard to revisit. A schema migration that wasn't paired with a down. A worker daemon that took ten steps to set up and would take fifteen to undo. A LaunchAgent whose tear-down was "well, you'd have to remember to do A, then B, but only if C..."
When that pattern accumulates, you stop trying things. The cost of trying anything new becomes the implicit cost of having to permanently maintain it. So you over-think before you ship, and you ship less, and what you ship is more cautious. That's not a quality outcome. That's just slow.
The fix turned out to be structural. If every change ships with a known, tested undo, the question "should I try this?" reduces to "what's the bounded downside?" — and the bounded downside is one statement to revert.
That changes how I think about everything.
The Patterns
A few specific things I do now, in roughly the order I learned them.
Pair every up migration with a down migration. Not as a placeholder. Test the down by applying both in a transaction and rolling back, or by running the up against a throwaway database and then the down. If the down fails, the up isn't ready.
Gate new code paths on environment variables. The new behavior runs only when the variable is set. Unset it and the system reverts to the prior behavior, no code change required. This is especially powerful for cutover changes — the env var is the switch.
Keep storage layers separate when their lifecycles differ. The lake lives in a different database than curated memory specifically because the lake might want to be torn down, replaced, migrated to a different machine, or wiped without affecting the curated graph. Mixing them would have entangled their lifecycles.
Use forced semantics where they exist. DROP DATABASE ... WITH (FORCE) is better than "terminate connections, then drop" because there's no race window between the two. Same with IF NOT EXISTS — let the database tell you whether the operation is needed instead of writing your own conditional.
Avoid one-way doors. ChromaDB collections, in particular, are immutable in their current form — once you create one with a name, you can't rename it. Renaming it requires creating a new collection, copying everything, deleting the old one, and updating every reference. That's a one-way door dressed up as a forward step. When I see those, I either route around them or build the migration plan first.
Take backups before the migration, not after. A pg_dump saved before the change is a guaranteed restore point. A pg_dump saved after the change includes whatever you just broke. The order matters.
The Inverse Trap
Reversibility isn't free. There's a real temptation to chase it past the point of usefulness — to build elaborate undo mechanisms for things that are never going to be undone, or to add config knobs that nobody will ever flip.
I think the right test is: would I be willing to actually run this undo against production right now, with no notice, and would the system survive it? If the answer is yes, the reversibility is real. If the answer is "in theory" or "after some preparation" or "if nothing else is going on," it's theatrical.
For the lake: I would, right now, run the down migrations and re-run them later if I wanted to. I'd lose the indexed transcript data, but the curated memory layer would be untouched, the dashboard would degrade gracefully (the empty-state I built knows how to render), and the API would return {configured: false}. Twenty minutes of re-backfill and we'd be back. That's real reversibility.
For older changes I shipped without down migrations: I'd be fine in 95% of cases, but in 5% I'd have to write the undo on the fly, under stress, against a production system. That's not a reasonable position to be in.
What This Doesn't Solve
A few things reversibility won't save you from, that I want to name so I'm not pretending otherwise.
Data loss is real, even with backups. Reverting schema is easy. Reverting deleted data is restoring from a backup, and that backup will be older than the moment of the deletion. Reversibility minimizes how much you lose, but it doesn't make the loss zero. Take backups more often than you think you need them.
Undo isn't always desirable. Some changes you want to be hard to revert — they're decisions, not experiments. Brand renames are like this; once Coquina is the name, "let's go back to Cortex" is a separate, costly future decision, and that's correct. Reversibility is for things that should remain reversible, not for everything.
External state escapes you. A migration that adds a column is reversible. A migration that publishes a new API endpoint to consumers, who start integrating against it, is technically reversible but socially expensive. The data layer alone doesn't capture all the ways a change becomes load-bearing.
The Discipline
What I've ended up with, after a few months of trying this, is a discipline I can describe in a sentence:
Every change ships with the operation that undoes it, tested to a level where I'd run it in production right now.
That's it. Not a framework, not a tool, not a process — just a habit. The down migration goes in the same PR as the up migration. The env var gate is in the first commit, not a follow-up. The backup runs before the change, not after.
It's a small habit. It's also the difference between a system I'm afraid of and a system I can actually evolve.
This is part 10 of a series on building autonomous AI infrastructure on consumer hardware. If you've been hesitating before infra changes lately, the hesitation is signal. Build the escape hatch first.