Saturday, May 16, 2026·4 min read

POSIX Filesystems on Object Storage: The Good, the Bad, the Fast

Training Pipes Team
Server hardware close-up

"POSIX filesystem on S3" is one of those phrases that can mean five different things depending on who's saying it. A FUSE daemon, a distributed filesystem, a protocol gateway, a lift-and-shift, or an outright lie. This post sorts them out.

What POSIX Actually Requires

POSIX is the Unix-y filesystem contract that almost every tool assumes. The parts that matter for real workloads:

  • Hierarchical namespace with inodes, permissions, owners
  • open() / read() / write() / close() with byte-level offsets
  • Atomic rename() within a directory
  • mmap() — mapping a file into memory
  • Advisory locks (flock, fcntl)
  • Directory listing with consistent results
  • Hard and symbolic links
  • Sparse files
  • fsync() guarantees

Most code doesn't use all of these. Most code assumes they exist.

What Object Storage Gives You

S3 and friends expose a completely different model:

  • Flat key-value namespace
  • PUT entire object or ranged PUT for multipart
  • GET with optional byte range
  • LIST with prefix and pagination
  • HEAD for metadata
  • No rename, no mmap, no links, no locks

Everything above the raw object API has to be emulated if you want POSIX on top.

The Approaches

Option A: FUSE Emulation

Daemons like s3fs-fuse pretend to be a filesystem in userspace and translate every call to S3 API calls. Covered in depth in our FUSE limitations post. Short version: fine for dev, bad for production.

Option B: Fully Rewritten Filesystems (JuiceFS, Alluxio, BeeGFS)

Projects like JuiceFS, Alluxio, and Ceph-backed filesystems decouple metadata from data. Metadata lives in a fast database (Redis, etcd, TiKV), data blocks live in object storage. You get real POSIX. You also get:

  • A metadata service you have to run and scale
  • A consistency model that's all-new to your team
  • Client libraries instead of (or alongside) kernel mounts
  • Complex failure modes when the metadata server is slow

These are legitimate architectures, but they're not "mount S3 and go."

Option C: NFS Gateway

A gateway server implements NFSv4 and translates to object storage behind the scenes. Clients mount standard NFS with zero custom code. The gateway handles:

  • Filename to object-key mapping
  • Metadata caching
  • Data caching
  • Atomic rename (via a short-lived lock + multi-step copy)
  • Protocol translation

This is the architecture Training Pipes uses. The gateway sees NFS operations, the object store sees S3 operations, clients see a filesystem.

Option D: Managed Cloud Filesystems (EFS, FSx)

Not actually object-backed — they're their own storage systems, with their own capacity and throughput billing. If you want "POSIX on S3" the answer here is "they're separate products; sync between them."

What Breaks, and How Each Approach Handles It

Atomic Rename

POSIX: rename("a", "b") is atomic.

S3: no rename exists. Copy + delete is not atomic.

  • FUSE: often broken under concurrency. Don't checkpoint over FUSE mounts.
  • Gateway: can be made atomic via metadata-level rename (cheap) with a backing key rewrite deferred.
  • JuiceFS/Alluxio: atomic, because metadata is the source of truth.
  • EFS: atomic, it's real NFS.

mmap

POSIX: map a file into virtual memory, page on demand.

S3: no concept of it.

  • FUSE: unreliable. Some daemons cache the whole file on first touch.
  • Gateway: if the cached copy is on local NVMe, mmap works against the cache.
  • EFS: works, but slow over network.

Directory Listings

POSIX: readdir() returns a consistent snapshot.

S3: LIST is eventually consistent, paginated, and priced per call.

  • FUSE: slow for big directories, sometimes inconsistent.
  • Gateway: caches listings locally for correctness and speed.
  • JuiceFS/Alluxio: instant, metadata is separate.

Locks

POSIX: flock and fcntl let processes coordinate.

S3: no lock primitive.

  • FUSE: usually no-op.
  • Gateway: can implement NFSv4 state locking server-side.
  • EFS: full lock support.

Consistency

POSIX: strict within a host, well-defined across NFS mounts.

S3: read-after-write for new objects (strong as of 2020). Read-after-overwrite is strong in modern S3, but varies across S3-compatible implementations.

  • All non-trivial architectures above layer in their own consistency logic on top.

What ML Workloads Actually Need

ML training is (mercifully) forgiving of a subset of POSIX:

Required:

  • Read files by name
  • Stat (for size/exists checks)
  • Directory listing
  • mmap on large shards (for some frameworks)
  • Atomic write-then-rename for checkpoints

Nice to have:

  • Hard links (for efficient "copy" of snapshots)
  • Locking (multi-worker coordination)

Usually unneeded:

  • POSIX ACLs beyond owner/group/other
  • Sparse files
  • Named pipes, device nodes

A well-designed NFS gateway over object storage hits the "required" list cleanly, and usually the "nice to have" list too. FUSE-based tools often miss parts of the required list silently.

The Training Pipes Approach

We pick the NFS-gateway model because it's the one that:

  1. Gives training code an honest filesystem (not a polite lie)
  2. Doesn't require a metadata service you have to babysit
  3. Works with unmodified clients (standard mount -t nfs4)
  4. Lets you use any object storage backend (managed by us, or BYO)

What you get:

  • NFSv4.0 and NFSv4.1
  • Atomic rename
  • Real directory semantics
  • POSIX permissions (mapped to per-mount identities)
  • Full cache coherency across clients of the same mount
  • S3-compatible API to the same data when you want it

You don't get a perfect POSIX experience because nobody does on top of object storage. You get the subset that matters for real workloads, with honest semantics about the rest.

Mount object storage as a real filesystem →