Postgres Is the Filesystem: Persistent Bash Across Replicas

June 2, 2026

Put the filesystem in Postgres sounds like a weekend project. Spin up a second replica and the picture changes: warm caches go stale, writers need a lock that works across replicas, and you are sketching version counters long before it feels safe to ship. The kind of rabbit hole anyone who enjoys distributed systems signs up for.

A filesystem agents can share across replicas

AI agents do not need another Linux VM per turn. They need a tree: ingest files, grep and transform them, write notes and scratch-pad analysis, come back later on the same sandbox id, maybe from a different machine. That is a filesystem problem: durable, remotely addressable, and not scary when traffic spikes and replicas multiply.

sql-fs is our answer: just-bash on a Postgres-backed IFileSystem with an adjacency-list schema and content-addressed blobs (src/sql-fs). The API layer is deliberately stateless: any replica can serve any sandbox. Strong consistency: at the start of every bash.exec, the replica compares vfs:ver to its last-seen version and reload()s if needed — agents always see the latest committed tree, not a stale snapshot from another replica’s turn. Postgres stays truth; Redis coordinates writers, warms cold paths, and lets read-only work fan out. You scale replicas on concurrent bash volume (we run on Azure Container Apps), not by pinning agents to sticky containers or replaying setup on every cold start.

Standing on just-bash

We did not build a bash interpreter. just-bash runs real bash in Node against a pluggable IFileSystem: no containers, no host exec(). sql-fs is one backend where every readFile and mkdir becomes SQL, wrapped in caches and locks so it can scale horizontally without lying to agents.

Bash stays thin on purpose. Durability and distributed semantics live in sql-fs and SessionManager.

Postgres as `IFileSystem`

The interface bash sees is unchanged: IFileSystem — readFile, writeFile, stat, readdir, mkdir, mv, and the rest. sql-fs (src/sql-fs) satisfies that contract and persists everything through SQL.

We split the design in two layers. sql-fs is the object bash actually calls: path normalization, in-process pathCache and contentCache, script transactions, read-only scopes, and the “patch cache after commit” bookkeeping. SqlDialect is the SQL side (Postgres today): recursive CTEs, inode/dirent writes, blob upserts, RLS via SET LOCAL app.sandbox_id, and advisory locks inside transactions. Swap the dialect and you keep the same IFileSystem surface; the HTTP layer does not care which database backs the sandbox.

Interactive · sql-fs design

Three views of the same design: what bash calls, how rows relate, and which paths hit RAM vs SQL.

The adjacency-list schema

The naive mapping is one row per file with the full path as the primary key. Simple, until you mv a directory with thousands of children and rewrite every path string. Nobody wants that.

An inode is a node (file, directory, or symlink) with metadata but no stored path. A dirent is a named edge: this directory has a child called foo pointing at that inode. /home/user/project/main.py is not a column anywhere. You reconstruct it by walking dirents from the sandbox root. mv /project/src /project/lib is one dirent update; inodes and blob bytes stay put.

File bytes live in blobs, keyed by SHA-256. Two files with identical content share one row. Agents that copy the same config twelve times only pay storage once. Deletes drop inode references; orphaned blobs get garbage-collected later.

How `sql-fs` makes bash believe it is a normal filesystem

On sandbox open, createPostgresSandboxFs inserts a sandboxes row, boots a root inode plus the usual /home, /tmp, /bin, and /home/user, then calls fs.ready(). That runs one recursive CTE (loadAllPaths) and fills pathCache: a Map from absolute path to inode metadata. From then on, hot-path metadata is RAM-only: stat, exists, and readdir are map lookups, not SQL round-trips.

There is a catch baked into just-bash: getAllPaths() must be synchronous. You cannot await Postgres inside it. So a SQL-backed IFileSystem has to front-load the tree into memory. That is not an optimization garnish; it is how the plugin contract works. File bytes are lazier: readFile tries contentCache, then Redis blob cache, then Postgres.

Writes go the other way: commit in Postgres first (inside #withTx, with RLS and pg_advisory_xact_lock on the sandbox), then update pathCache / contentCache and mark the FS dirty so the session can publish vfs:ver for other replicas. Bash still thinks it called writeFile on a plain filesystem. Under the hood it was inode rows, a dirent edge, and maybe a new blob row. The distributed sections below are what happen when more than one replica shares that story.

A distributed filesystem, not a distributed bash

sql-fs is a pool of stateless HTTP replicas (Azure Container Apps scales on concurrent bash volume). Each replica keeps a warm Session per active sandbox: a Bash instance wired to a sql-fs with in-process caches. Postgres holds the tree; Redis coordinates replicas. Rule: Postgres is truth; everything else is cache or lock. Per sandbox, agents get strong consistency: one serial history of mutating execs, and every exec observes the latest committed tree at entry — not POSIX-everywhere linearizability across a single long-lived shell session.

Fifty greps on replica A while replica B appends to the same log directory is when you still need answers: who may run bash at the same time, how replicas coordinate writers, and how we guarantee no agent turn answers from RAM that Postgres already superseded. Preferably without putting Raft on every stat().

Interactive · concrete write exec

Example: append a line to a log file on sandbox sb-7a2f

Ready

Request

POST /sandboxes/sb-7a2f/exec
{ "script": "echo run-2 >> /workspace/runs.log" }

File state · /workspace/runs.log

run-1

vfs:ver 7 Redis · RW lock · pg advisory ·

Press Play. The diagram highlights where the request is in the stack at each gate.

Write path: SessionManager.withSession, then cross-replica exclusive lock → version check → local exclusive lock → bash → Postgres → publish version. Read path uses a shared cross-replica RW lock and shared session.lock instead.

Three caches, one write path

L1 (in-process, per sandbox session) is what makes bash feel fast. sql-fs loads the full path tree into pathCache once, via a recursive CTE or a Redis path snapshot on cold start when the embedded version matches vfs:ver. After that, stat, readdir, and exists never touch SQL. File bytes live in a 50 MB LRU contentCache keyed by inode, filled on first readFile. With the synchronous path-tree contract from above, L1 is not optional — it is how SQL backends plug in at all.

L2 (Redis, shared) does three jobs. Blob cache stores immutable sha256 → bytes so repeat reads skip Postgres. The version key vfs:ver:{sandboxId} is the cross-replica freshness signal: after a dirty exec, we INCR it while still holding the write lock. Optional path snapshots write the whole L1 tree to Redis with the version stamped inside, so cold replicas can rebuild pathCache from one GET instead of a full CTE walk.

L3 (Postgres) is the adjacency list: inodes, dirents, content-addressed blobs, RLS scoped by sandbox_id. On a write, we always commit here first inside #withTx, then patch L1, then fire-and-forget L2 blob populate. Reads try L1 → L2 blob → L3. If you deleted every cache tonight, correctness would survive. Agents would just hate you, briefly.

RAM may lag Postgres between execs on a warm replica — that is fine because nothing reads L1 until someone starts a turn. At every exec entry, ensureFreshCache compares session.lastSeenVersion to Redis; on mismatch, reload() clears L1 and rebuilds from L2 snapshot or Postgres before the first syscall. That version gate is what gives us strong consistency across replicas without pub/sub on every inode — warm L1 on an idle replica never reaches bash without a version check first. The Redis exec lock still guarantees only one writer at a time, so the integer checked at boundaries is enough.

Interactive · caches & locks map

Ready

Choose Write or Read, then Play, one arrow at a time. Stop becomes Resume so you can continue from the current step.

L1 is per-replica RAM; L2 coordinates replicas; L3 is truth. Locks sit across replicas (Redis), on one replica (RW), and per transaction (Postgres advisory).

Locks, RW grants, and script transactions

Locks answer who may run bash at the same time. The script transaction answers whether a single script’s writes are all-or-nothing. Related problems. Not the same problem.

Redis exec lock (SET NX PX, heartbeat, Lua release) serializes mutating execs for a sandbox across every replica. It is held for the whole withSession callback (version check, bash, publish), not per syscall. If Redis is down we return 503. We do not write without it.

pg_advisory_xact_lock is the backstop inside each write transaction. If a Node GC pause outlasts the Redis lease, two replicas could overlap; we hit that in load tests before the advisory path was non-negotiable. The lock still serializes inode/dirent writes at the database. We use the transaction-scoped variant so it survives under PgBouncer transaction pooling. Session-scoped advisory locks quietly break there.

Neither the Redis exclusive path nor the advisory lock helps you run ten greps in parallel on one replica. session.lock is a hand-rolled async RW lock: exclusive for default exec, shared for read_only. Writers get priority. A queued writer blocks new readers so exploration cannot starve mutation. When a writer finishes and readers are waiting, the lock wakes the entire reader cohort in one batch.

While those locks are held on the write path, every bash.exec wraps filesystem mutations in a script transaction: a lazy Postgres transaction opens on the first write inside the script, commits on success, rolls back and reload()s L1 on failure. A script that creates five files and dies on the sixth leaves no orphaned inodes for the next agent turn. That atomicity is why we bother with a per-script DB tx, not why we take the distributed lock. Read-only exec disables script transactions entirely (concurrent readers would share one tx handle).

read_only=True is the parallel read path through the same stack. Across replicas, a shared Redis RW lock lets multiple replicas serve read-only execs at once while writers hold an exclusive flag. On the replica, session.lock in shared mode runs many greps concurrently; beginReadOnlyScope() makes every mutating sql-fs syscall throw EREADONLY before SQL, and AsyncLocalStorage attributes violations per caller so one bad shell redirection does not poison siblings. exec_batch(..., read_only=True) collapses exploration into one HTTP round-trip and fans out up to 16 scripts under a single shared grant.

Interactive · serialized vs parallel read-only

Exclusive lock · 10 sequential execs

Shared lock · one read_only batch

Shared-lock parallel read vs one-at-a-time exclusive exec.

~54 ms for 10 parallel read-only scripts vs ~7 s one-by-one (production).

Strong consistency across replicas

This is a distributed-system pitfall, and agents should never hit it. Replica A finishes a mutating exec: Postgres is updated, A’s in-process pathCache is patched, and the walkthrough above ends with INCR vfs:ver:{sandboxId}. Replica B was idle. It still holds the old tree in RAM from the last time it served this sandbox. The load balancer sends the next agent turn to B. Without a guard, B would answer from stale L1 even though Postgres already moved on. Oops-shaped.

Every exec runs ensureFreshCache first: if vfs:ver moved, reload() rebuilds L1 from Postgres (or a version-matched path snapshot) before bash runs — latest committed state at exec entry, not warm RAM left over from another replica’s turn. Mutating execs are linear cluster-wide too: at most one writer per sandbox (Redis exec lock, pg_advisory_xact_lock as backstop), so committed history is serial with no lost updates or torn trees. We do not pub/sub every inode change; the version integer checked at boundaries is enough.

trigger:  write finished on replica A → INCR vfs:ver
later:    LB routes next exec to replica B
on entry: GET vfs:ver → if ≠ lastSeen → sql-fs.reload() → then bash.exec

The interactive below walks two cases: that cross-replica stale-cache recovery, and mixed read_only / write on the same or different replicas. Use Stop on the diagram to freeze at any step.

Interactive · coherence scenarios

1 · Warm state

Postgres v1

vfs:ver 1

Redis lock ·

RW lock A ·

RW lock B ·

Replica A/data.txt → v1

Replica B/data.txt → v1

Both replicas warm at version 1.

Play through strong consistency across replicas (exec-boundary vfs:ver gate) or mixed read/write races (same replica RW lock + version check). Click any step to jump.

A lightweight alternative to full sandboxes

Container sandboxes are the right tool when an agent needs a real OS: npm install, git clone, outbound network, arbitrary binaries. Most agent turns are not that. They are file work: ingest a tree, grep and awk across it, leave notes and analysis on disk, come back later on the same sandbox id. sql-fs is a lightweight layer on Postgres and just-bash: no micro-VM per sandbox, no kernel to keep warm, but still the bash surface agents already know.

You still get what agents need for file-heavy sandboxes: create/list/delete sandboxes, bulk fs_ingest / fs_export, sync and streaming exec, read_only and exec_batch for parallel exploration, optional Python/JS runtimes inside just-bash, MCP tools, and the distributed semantics earlier in this post (locks, vfs:ver, script transactions). What you do not get is a full Linux VM. That is the trade, on purpose.

Production latency

On a warm sandbox (133 ts files, Azure Container Apps Australia East), wall-clock stays in the tens to low hundreds of milliseconds for typical agent scripts. Latency tracks real work, not a fixed container-exec floor.

3 lifecycle runs + 5 measured exec runs per case (1 warmup discarded). Reproduce with scripts/benchmark_remote_bash.py in the sql-fs repo.

Interactive · avg wall-clock ms (sql-fs)

Ingest is one HTTP round-trip for all files; exec latency scales with script work (grep and writes cost more than echo).

Two scaling surfaces, not one VM per sandbox

The bash/API layer is stateless HTTP: replicas scale with how many execs are in flight. We run this on Azure Container Apps; aca.yaml is a deployable template you can adapt. A session stays warm in RAM on whichever replica served it last, but any replica can pick up a sandbox after a version check. You scale for bash concurrency, not one container per sandbox id.

Postgres. The tree, blobs, and coherence metadata live in the database. Scaling here is scaling database compute and I/O too: bigger instances, poolers, read replicas for cold path-tree loads and cache misses (writes still go to the primary). Content-addressed blobs dedupe bytes; sandboxes are rows you can partition by tenant. Heavy load may mean more API replicas and a larger Postgres. Idle sandboxes do not need warm VMs.

Cost shape

Hosted sandboxes typically bill vCPU and memory per second for as long as the environment exists, idle or busy. Mintlify’s ChromaFs write-up walks through the same economics for doc assistants at high conversation volume: the meter runs on reserved compute, not on how many greps you needed. sql-fs flips the model. Sandboxes are durable rows and blobs in your database. You pay for API time while requests run and for storage/IO as the tree grows, not for keeping a VM open between agent turns. For fleets of file-heavy agents, that unit economics difference usually beats shaving milliseconds off a single exec.

When you need full isolation and package managers, use a container sandbox. When the job is durable file manipulation at scale, Postgres-backed sql-fs is the lighter default.

MCP and operator skills

Agents should not have to reverse-engineer OpenAPI to grep a tree. sql-fs exposes the same capabilities over MCP (streamable HTTP at /mcp, MCP 2025-03-26) with short tool names so context stays cheap: sandbox_create, sandbox_list, sandbox_delete, bash_exec, bash_exec_batch, fs_ingest, fs_export. The tool descriptions spell out what bash can and cannot do, when to bundle read-compute-write into one script, and that readOnly: true on bash_exec_batch fans out parallel exploration in a single round-trip.

Point agents at your deployment using the same Bearer JWT as the REST API. In the repo I ship a plugin (Claude, Codex, Cursor) under plugins/sql-fs with two operator skills: api (curl, auth bootstrap, ingest, exec patterns) and py-sdk (Python client workflows). They are reference material for humans and for agents: setup steps, endpoint shapes, error codes, and copy-paste examples so a session can go from zero to “ingest this tree and grep it” without rereading the whole spec each time.

Bring your own infra

sql-fs is not a hosted black box you have to trust. The service is a container image plus config you run on infrastructure you control. The repo includes aca.yaml, an Azure Container Apps manifest we use in production, as a concrete recipe rather than lock-in: stateless API replicas, secrets for DATABASE_URL, DATABASE_DIRECT_URL, REDIS_URL, and AUTH_SECRET, HTTP scale on concurrent requests and health probes on /healthz / /readyz.

Bring your own Postgres and Redis. Fill in subscription, environment, and connection strings; deploy with your preferred method using the provided manifest. You own the data plane, the auth secret, and the scaling policy so the filesystem and sandbox rows live in your database, not ours.

Code: https://github.com/Hazzng/sql-fs