Most teams building AI infrastructure don't start from a blank slate. They have data — lots of data — already sitting in S3 or GCS or some S3-compatible store. They don't want to migrate it. They want to use it, over NFS, with caching, without the migration headache.

This is what Training Pipes's bring-your-own (BYO) storage path is for. This post explains how it works, when you'd use it versus our managed buckets, and what you get either way.

The Two Paths

Training Pipes gives you two ways to back your mounts:

Managed Buckets (the default path)

You create a bucket on Training Pipes. We handle the backing object storage, lifecycle, durability. You get S3-compatible API credentials plus NFS/SMB mounts.

This is the fastest path to get started — everything is handled.

BYO Connections (the advanced path)

You point Training Pipes at a bucket you already own: in AWS S3, Google Cloud Storage, Cloudflare R2, Backblaze B2, MinIO, Wasabi, or any S3-compatible endpoint. We don't move the data. We expose it through our gateways.

Same mount experience. Same caching. Same protocol options. Your bucket stays yours.

When BYO Makes Sense

BYO is the right choice when:

You already have terabytes or petabytes in a cloud bucket. Migration is expensive and risky.
Compliance requires data to stay in your cloud account. Sensitive datasets under your org's IAM control.
You have specific lifecycle / replication rules on your existing bucket that you want to keep.
You want to use existing reserved capacity or negotiated rates with your cloud provider.
Multi-cloud is a strategic requirement. You want data in GCS for reasons, but compute on AWS.

When Managed Buckets Make More Sense

Managed buckets are the right choice when:

You're starting fresh. Nothing to migrate.
You want a single vendor relationship for storage, caching, and access.
You don't want to manage a cloud storage account for the specific dataset.
You want predictable pricing that includes storage in the plan.

Most teams end up with a mix: managed buckets for new work, BYO for existing corpora.

How BYO Works Under the Hood

When you create a BYO connection, you provide:

Provider (AWS S3, GCS, R2, Azure, S3-compatible)
Bucket name and endpoint
Credentials (read or read/write, scoped as tightly as you like)
Region (so we know where to place gateways for locality)

Training Pipes stores the credentials encrypted. Gateways in your chosen region use them to fetch and (optionally) write objects on your behalf. Your data never leaves your cloud account in the sense of moving to ours — only in the sense of flowing through a gateway.

The CLI Flow

# Create a BYO connection to an existing S3 bucket
bucketfs connections create \
  --name my-existing-data \
  --provider aws-s3 \
  --bucket my-company-ml-data \
  --region us-east-1 \
  --access-key $AWS_ACCESS_KEY_ID \
  --secret-key $AWS_SECRET_ACCESS_KEY

# Create an NFS mount backed by that connection
bucketfs mount create \
  --connection my-existing-data \
  --region us-east-1 \
  --protocol nfs \
  --cache-size 500GB

From your training script's perspective, you can't tell the difference between a managed-bucket mount and a BYO mount. The filesystem path is the same, the protocol is the same, the caching is the same.

Multi-Cloud in Practice

A common BYO scenario looks like this:

Raw ingested data in GCS (your analytics pipeline puts it there)
Training compute on AWS (where your GPU quota lives)
Published model artifacts in R2 (cheap egress for your downstream consumers)

Without a unifying layer, you're copying data across clouds at internet-egress rates. With BYO connections, you create three connections (one per cloud), mount them into your AWS training cluster, and only the bytes your cache actually fetches cross the cloud boundary — and only once.

Permissions: Keep Them Tight

We recommend a scoped IAM role for every BYO connection:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-company-ml-data",
        "arn:aws:s3:::my-company-ml-data/*"
      ]
    }
  ]
}

Add s3:PutObject and s3:DeleteObject only if your mount needs to write. Everything above is read-only; we encourage that as the default.

For organizations using AWS STS, you can provide a cross-account role to assume instead of long-lived credentials. Ask us — we have an enterprise docs page on this.

Cost Model for BYO

BYO doesn't change your object-storage bill — you keep paying your provider for storage and whatever egress happens. It adds:

Gateway + cache cost (included in plan tiers)
One-time egress from your bucket to the gateway on cache miss (cached thereafter)

For most training workloads, the cache-hit ratio after warmup is high enough that net egress drops below what you were paying for direct training reads of the same data.

The Hybrid Pattern

Many customers use both paths:

Managed buckets for working datasets actively being iterated on
BYO connections for large historical corpora that already live in their cloud

Both flow through the same gateways, both show up as mounts, both benefit from the shared cache. You pick per-dataset, not per-account.

Quick Checklist

Before setting up BYO, have:

Bucket name and region on file
Scoped IAM credentials (read-only or read-write as needed)
Confirmation that your bucket policy allows the required access
A plan for which Training Pipes region to place gateways in (usually: wherever your training cluster is)

Five minutes of setup; no data migration.

Summary

BYO is how Training Pipes fits into existing data estates. Your canonical data stays where it is, in your cloud account, governed by your existing policies. We give you the regional gateway, the cache, the NFS/SMB mount, the shared access across clusters. Same product experience as managed buckets — different backing store.

Connect an existing bucket →

Saturday, June 13, 2026·4 min read

Bring Your Own S3 Bucket: Unifying AI Storage Across Clouds