# Data Mounting

## How C3 stores data[​](#how-c3-stores-data "Direct link to How C3 stores data")

C3 keeps all data (datasets you upload and artifacts your jobs produce) on a centralised storage server with high-bandwidth connections to GPU nodes around the world. Think of it like a warehouse on a motorway network: the links between the warehouse and the GPUs are fast, high-bandwidth connections. Uploading from your local machine is the bottleneck, since home and office network connections to remote servers are much slower than server-to-server transfers. The good news is you only need to upload once. After that, C3 moves data between its storage and GPUs at full speed.

## Paths[​](#paths "Direct link to Paths")

Data in C3 can be organised by project, by job, or both:

| Path                                | What it contains                  |
| ----------------------------------- | --------------------------------- |
| `/datasets/{name}/`                 | Uploaded datasets                 |
| `/jobs/{jobId}/`                    | Job output artifacts              |
| `/projects/{project}/data/{name}/`  | Datasets scoped to a project      |
| `/projects/{project}/jobs/{jobId}/` | Job artifacts scoped to a project |

You can use whichever path style suits your workflow. `/jobs/{jobId}/` resolves the project automatically.

## Upload a dataset[​](#upload-a-dataset "Direct link to Upload a dataset")

```
c3 data cp ./local-data/ /datasets/my-dataset/
```

This uploads your data to C3's centralised storage. You only need to do this once. After the initial upload, every `c3 deploy` that references this dataset gets rapid access to it directly from the storage network, with no re-upload needed.

C3 uses content-addressed deduplication: each file is hashed (SHA256) before upload, and if the content already exists, the upload is skipped. This means re-uploading a dataset with minor changes only transfers the files that actually changed, and overall storage usage can be lower than standard methods since identical files are never stored twice (see [How deduplication works](#how-deduplication-works) below).

## Browse data[​](#browse-data "Direct link to Browse data")

Use `c3 data ls` to browse datasets, versions, and files:

```
c3 data ls /datasets/                          # List all datasets
c3 data ls /datasets/my-dataset/               # List versions
c3 data ls -l /datasets/my-dataset/@latest/     # List files in latest version
```

## Mount a dataset in a job[​](#mount-a-dataset-in-a-job "Direct link to Mount a dataset in a job")

Reference the dataset in your `.c3` config:

```
datasets:
  - ref: /datasets/my-dataset
    mount: /data/my-dataset
```

Once referenced, C3 handles moving the data to whichever GPU your job lands on. From your script's perspective, the files are simply local at the mount path. You read them like any other files:

```
import numpy as np

data = np.loadtxt("/data/my-dataset/measurements.csv", delimiter=",")
```

### Mount path rules[​](#mount-path-rules "Direct link to Mount path rules")

* Mount paths must be **absolute** (start with `/`). Relative paths are rejected at submission time with a clear error.
* If `mount` is omitted, it is auto-derived as `/data/<dataset-name>` (e.g., `/datasets/cifar10` becomes `/data/cifar10`).
* In `.c3` YAML, a relative mount like `mydata` is auto-prefixed to `/data/mydata`.

### Local directories[​](#local-directories "Direct link to Local directories")

You can reference a local directory as a dataset. C3 auto-uploads it before submitting the job:

```
datasets:
  - ref: ./local-data
    mount: /data/train
```

This is equivalent to running `c3 data cp ./local-data/ /datasets/...` yourself, but handled automatically.

## Versioning[​](#versioning "Direct link to Versioning")

Every upload creates a new version. Your jobs always get exactly the data they expect:

```
c3 data log /datasets/my-dataset/
```

```
VERSION   CREATED              FILES   SIZE
v3        2024-01-15 10:00:00  1000    2.5GB
v2        2024-01-10 09:00:00  1000    2.4GB
v1        2024-01-05 08:00:00  500     1.2GB
```

Jobs reference the latest version by default, or you can pin to a specific version for reproducibility.

## How deduplication works[​](#how-deduplication-works "Direct link to How deduplication works")

All data in C3 (datasets, workspaces, and job artifacts) uses the same content-addressed storage. Every file is stored as a **blob** keyed by its SHA256 hash, and a **manifest** lists which blobs make up each dataset, workspace, or set of job artifacts.

This means:

* **Cross-job dedup**: If two jobs produce identical output files, the data is stored once
* **Workspace dedup**: Re-deploying the same code skips uploading unchanged files
* **Cross-dataset dedup**: Identical files shared across datasets use the same storage
* **Instant re-uploads**: `c3 data cp` only uploads files that have actually changed

Deduplication is automatic and transparent. Artifacts still appear per-job (each job has its own listing), but identical files across jobs share storage behind the scenes.

## Data commands[​](#data-commands "Direct link to Data commands")

| Command                | Description                                   |
| ---------------------- | --------------------------------------------- |
| `c3 data ls /path/`    | List files, datasets, or job artifacts        |
| `c3 data cp SRC DST`   | Copy files (upload or download)               |
| `c3 data rm -r /path/` | Delete a dataset (requires `-r` for datasets) |
| `c3 data du /path/`    | Show disk usage                               |
| `c3 data log /path/`   | Show version history                          |
