Saturday, May 16, 2026·4 min read
POSIX Filesystems on Object Storage: The Good, the Bad, the Fast
"POSIX filesystem on S3" is one of those phrases that can mean five different things depending on who's saying it. A FUSE daemon, a distributed filesystem, a protocol gateway, a lift-and-shift, or an outright lie. This post sorts them out.
What POSIX Actually Requires
POSIX is the Unix-y filesystem contract that almost every tool assumes. The parts that matter for real workloads:
- Hierarchical namespace with inodes, permissions, owners
open()/read()/write()/close()with byte-level offsets- Atomic rename() within a directory
mmap()— mapping a file into memory- Advisory locks (flock, fcntl)
- Directory listing with consistent results
- Hard and symbolic links
- Sparse files
fsync()guarantees
Most code doesn't use all of these. Most code assumes they exist.
What Object Storage Gives You
S3 and friends expose a completely different model:
- Flat key-value namespace
- PUT entire object or ranged PUT for multipart
- GET with optional byte range
- LIST with prefix and pagination
- HEAD for metadata
- No rename, no mmap, no links, no locks
Everything above the raw object API has to be emulated if you want POSIX on top.
The Approaches
Option A: FUSE Emulation
Daemons like s3fs-fuse pretend to be a filesystem in userspace and translate every call to S3 API calls. Covered in depth in our FUSE limitations post. Short version: fine for dev, bad for production.
Option B: Fully Rewritten Filesystems (JuiceFS, Alluxio, BeeGFS)
Projects like JuiceFS, Alluxio, and Ceph-backed filesystems decouple metadata from data. Metadata lives in a fast database (Redis, etcd, TiKV), data blocks live in object storage. You get real POSIX. You also get:
- A metadata service you have to run and scale
- A consistency model that's all-new to your team
- Client libraries instead of (or alongside) kernel mounts
- Complex failure modes when the metadata server is slow
These are legitimate architectures, but they're not "mount S3 and go."
Option C: NFS Gateway
A gateway server implements NFSv4 and translates to object storage behind the scenes. Clients mount standard NFS with zero custom code. The gateway handles:
- Filename to object-key mapping
- Metadata caching
- Data caching
- Atomic rename (via a short-lived lock + multi-step copy)
- Protocol translation
This is the architecture Training Pipes uses. The gateway sees NFS operations, the object store sees S3 operations, clients see a filesystem.
Option D: Managed Cloud Filesystems (EFS, FSx)
Not actually object-backed — they're their own storage systems, with their own capacity and throughput billing. If you want "POSIX on S3" the answer here is "they're separate products; sync between them."
What Breaks, and How Each Approach Handles It
Atomic Rename
POSIX: rename("a", "b") is atomic.
S3: no rename exists. Copy + delete is not atomic.
- FUSE: often broken under concurrency. Don't checkpoint over FUSE mounts.
- Gateway: can be made atomic via metadata-level rename (cheap) with a backing key rewrite deferred.
- JuiceFS/Alluxio: atomic, because metadata is the source of truth.
- EFS: atomic, it's real NFS.
mmap
POSIX: map a file into virtual memory, page on demand.
S3: no concept of it.
- FUSE: unreliable. Some daemons cache the whole file on first touch.
- Gateway: if the cached copy is on local NVMe, mmap works against the cache.
- EFS: works, but slow over network.
Directory Listings
POSIX: readdir() returns a consistent snapshot.
S3: LIST is eventually consistent, paginated, and priced per call.
- FUSE: slow for big directories, sometimes inconsistent.
- Gateway: caches listings locally for correctness and speed.
- JuiceFS/Alluxio: instant, metadata is separate.
Locks
POSIX: flock and fcntl let processes coordinate.
S3: no lock primitive.
- FUSE: usually no-op.
- Gateway: can implement NFSv4 state locking server-side.
- EFS: full lock support.
Consistency
POSIX: strict within a host, well-defined across NFS mounts.
S3: read-after-write for new objects (strong as of 2020). Read-after-overwrite is strong in modern S3, but varies across S3-compatible implementations.
- All non-trivial architectures above layer in their own consistency logic on top.
What ML Workloads Actually Need
ML training is (mercifully) forgiving of a subset of POSIX:
Required:
- Read files by name
- Stat (for size/exists checks)
- Directory listing
mmapon large shards (for some frameworks)- Atomic write-then-rename for checkpoints
Nice to have:
- Hard links (for efficient "copy" of snapshots)
- Locking (multi-worker coordination)
Usually unneeded:
- POSIX ACLs beyond owner/group/other
- Sparse files
- Named pipes, device nodes
A well-designed NFS gateway over object storage hits the "required" list cleanly, and usually the "nice to have" list too. FUSE-based tools often miss parts of the required list silently.
The Training Pipes Approach
We pick the NFS-gateway model because it's the one that:
- Gives training code an honest filesystem (not a polite lie)
- Doesn't require a metadata service you have to babysit
- Works with unmodified clients (standard
mount -t nfs4) - Lets you use any object storage backend (managed by us, or BYO)
What you get:
- NFSv4.0 and NFSv4.1
- Atomic rename
- Real directory semantics
- POSIX permissions (mapped to per-mount identities)
- Full cache coherency across clients of the same mount
- S3-compatible API to the same data when you want it
You don't get a perfect POSIX experience because nobody does on top of object storage. You get the subset that matters for real workloads, with honest semantics about the rest.