Monday, June 1, 2026·4 min read

Sharing Datasets Across Training Runs Without Copying Terabytes

Training Pipes Team
Interconnected server infrastructure

"We have the dataset on three different VMs, nobody remembers who has the latest version, and we just spent another two hours re-downloading it." This is the state of dataset management at many ML teams past their first year, and it gets worse as the team grows.

This post is about building a sane, shared dataset layer that multiple engineers, multiple jobs, and multiple clusters can use without copying bytes unnecessarily.

The Symptoms

You probably have this problem if:

  • Engineers keep personal copies of datasets on their dev VMs
  • Every new training job starts with a aws s3 sync that takes hours
  • Nobody is quite sure which version of the dataset a given experiment used
  • Local disk on training nodes fills up with dataset copies nobody is actively using
  • Someone occasionally deletes a dataset copy that turned out to be the canonical one

The Principles

A good shared-dataset layer has three properties:

  1. Single source of truth. One canonical location per dataset version.
  2. Cheap and fast local access. Anyone in any cluster can read it at memory-bandwidth speeds without copying.
  3. Explicit versioning. Immutable snapshots with clear identifiers.

Here's how to get there.

Step 1: One Canonical Location Per Dataset

Pick an object-storage bucket (or a Training Pipes managed bucket) as the system of record. Name it like prod-datasets or research-datasets. Write, once, the rule: this is where datasets live.

Organize by dataset and version:

s3://prod-datasets/
  imagenet/
    v1/               # immutable; never modified
    v2/
  ms-coco/
    v1/
  internal/user-logs/
    2026-03-01/       # dated snapshots
    2026-04-01/

Immutability is key. Once v1 is written, it doesn't change. New data = new version. This is the single biggest lever for reproducibility.

Step 2: Mount It Read-Only Everywhere

Every cluster, VM, and dev box gets a read-only NFS mount of the dataset bucket. No copies. No syncs. Just a mount.

sudo mount -t nfs4 -o ro $gateway:/prod-datasets /mnt/datasets

Training jobs reference the path:

train_data = WebDataset("/mnt/datasets/imagenet/v1/train-{000000..001023}.tar")

Dev scripts reference the path:

df = pd.read_parquet("/mnt/datasets/user-logs/2026-04-01/sample.parquet")

One mount, many consumers, zero copies.

Step 3: Let the Caching Layer Do Its Job

This is where object storage alone falls short. Direct S3 reads across five training jobs and eight engineers means the same data gets fetched dozens of times.

A regional caching gateway (like Training Pipes) sits between the mount and the object storage. The first reader pays the object-storage fetch. Every subsequent reader in the same region gets it at local NVMe speed.

Multiply this across a research team running experiments on the same few datasets: one shared cache serves everyone, and your egress bill drops by an order of magnitude.

Step 4: Version Releases Explicitly

When you produce a new dataset version, write it to a new path. Never overwrite. If you need to correct a bad build, write v1.1 — don't rewrite v1.

For reproducibility, pin experiments to a specific version path:

DATA_VERSION = "v1"
train_path = f"/mnt/datasets/imagenet/{DATA_VERSION}/train-*.tar"

Or use symlinks within your dataset layout if you want a "current" pointer that some jobs follow:

/mnt/datasets/imagenet/current -> v2/

Step 5: Write New Data to a Separate Mount

Read-only mounts are for consumption. For producing new datasets (ETL outputs, feature snapshots), use a separate write mount into a staging bucket.

sudo mount -t nfs4 $gateway:/dataset-staging /mnt/staging

Your ETL job writes to /mnt/staging/imagenet/v3/. Once done, an atomic move (or a dataset-registry update) promotes it to the read-only canonical location.

This separation keeps the "consume" and "produce" paths clean and prevents accidental overwrites.

Step 6: Keep a Dataset Registry (Even If Simple)

A YAML file in a repo is enough:

datasets:
  imagenet:
    versions:
      v1:
        path: s3://prod-datasets/imagenet/v1/
        size_gb: 180
        sample_count: 1281167
        created: 2026-01-15
        notes: "Standard ILSVRC 2012 splits"
      v2:
        path: s3://prod-datasets/imagenet/v2/
        size_gb: 185
        sample_count: 1281167
        created: 2026-03-20
        notes: "Added EXIF metadata; rebuilt shards for WebDataset"

Automate enforcement when you can — require jobs to reference a version from the registry, not arbitrary paths.

What This Looks Like with Training Pipes

We designed the product around this pattern, so it maps cleanly:

  • Managed bucket or BYO connection = your canonical dataset storage
  • NFS mounts per region = read paths for every cluster
  • Write mounts (rw) for staging buckets = the ETL output path
  • Cache shared across mounts = no duplicate fetches across jobs or users
  • Preload = warm the cache before a big research sprint starts

Multiple engineers running experiments in parallel see the same dataset, read at local speed, from a single shared cache.

Antipatterns to Avoid

Per-engineer aws s3 sync of the same dataset. Egress costs × number of engineers.

Copying datasets into container images. Bloats your registry, slows pulls, and invalidates layer caches constantly.

Mutating datasets in place. Breaks reproducibility. Always version.

Storing datasets only on local NVMe. Loses on node rotation. Not shared.

Using git-lfs or git-based versioning for datasets >10GB. Just don't.

The Payoff

Teams that get this right tell us the same things: new engineers are productive on day one (no setup to download 40TB), experiments are reproducible, storage bills stop surprising them, and they never re-run an epoch just because someone accidentally overwrote the training split.

The primitive is simple: a shared, versioned, mounted dataset layer with regional caching. Training Pipes bundles it into a product.

Build a shared dataset layer →