Checkpointing Large Models: A Storage Guide for ML Engineers
Writing a 500GB checkpoint every hour stresses your storage in ways that training data doesn't. Here's how to design a checkpoint pipeline that's fast, reliable, and doesn't cost a fortune.
PyTorch DataLoader Storage Benchmarks: Throughput That Actually Matters
Synthetic storage benchmarks lie about what DataLoader performance feels like in practice. Here's how to measure what your training pipeline actually cares about.
POSIX Filesystems on Object Storage: The Good, the Bad, the Fast
POSIX semantics on top of object storage is an old and messy problem. Here's what's possible, what's impossible, and what ML teams should actually demand from a storage layer.
How Regional Caching Gateways Cut ML Data Loading Time by 10x
A caching gateway colocated with your GPUs is the biggest single lever for training throughput. Here's how the architecture works and why it produces such dramatic speedups.
Mounting S3 as NFS: Why FUSE Isn't Enough for Production
Searching for 'mount S3 as NFS' turns up a dozen FUSE-based tools. Here's why none of them survive production ML workloads, and what actually works.