Forgather Dataset Server¶
A Uvicorn + FastAPI server that exposes the Forgather
fast_load_iterable_dataset machinery over HTTP. The same
fast_load_iterable_dataset(...) call routes locally when the
FORGATHER_DATASET_SERVER env var is unset and through the
server when it's set. What flows over the wire is a thin
RemoteBackend wrapped in the usual ComposableIterableDataset,
so client-side .shuffle(), .shard(), .map(), .state_dict()
etc. all "just work." No template / config changes on the client.
Two use cases this is built for:
-
Multi-node training. One node in the cluster hosts the datasets (HF cache, named local datasets); every other node consumes them remotely. Avoids each rank re-downloading a multi-TB dataset just to start training.
-
Remote workstation, single-host data. You have one machine with all your large datasets (
HuggingFaceTB/smollm-corpus,allenai/c4, …) and you want to run training from a different site that doesn't have them mirrored. Forward the server's port over SSH and the loader on the remote workstation transparently uses the data on the dataset host — without exposing the dataset port on the public internet:
# On the dataset host (one-time):
forgather dataset-server start
# → prints `dataset_server auth token: <hex>`; copy it.
# On the remote workstation:
ssh -N -L 8766:127.0.0.1:8766 datahost &
export FORGATHER_DATASET_SERVER=http://localhost:8766
export FORGATHER_DATASET_SERVER_TOKEN=<hex from above>
forgather train -t my_config.yaml
The token must be set explicitly here. The localhost
auto-discovery would otherwise try to read
~/.config/forgather/dataset_server/8766.token on the WORKSTATION
(the loader sees a loopback URL — it can't tell that the
tunnel terminates on a different host) — which either
doesn't exist or holds a token unrelated to the dataset_server
you're tunneling to. Setting FORGATHER_DATASET_SERVER_TOKEN
short-circuits the file lookup. See Authentication.
Quick start¶
Same host (zero-config auth)¶
# Terminal 1 — start the server. Token is auto-generated and
# written to ~/.config/forgather/dataset_server/8766.token (mode 0600).
forgather dataset-server start
# Terminal 2 — point clients at it. The localhost token is
# auto-discovered by the loader and the diagnostic CLI; no token
# env var needed.
export FORGATHER_DATASET_SERVER=http://localhost:8766
# Confirm it's reachable.
forgather dataset-server status
# Any forgather command that loads datasets now routes through
# the server transparently.
forgather train
Cross-host (explicit token)¶
# Server (data host) — bind to a real interface; auto-token is
# printed to stderr on startup. Distribute that token to the
# clients via your usual secret channel (config-management,
# secret store, scp, etc.) and write it to ~/.fdss.token there
# with mode 0600.
#
# Cross-host bind without TLS is refused unless --insecure. Run
# `forgather tls init` on this host first (and `forgather tls
# install` on every client) — see docs/operations/tls.md.
forgather dataset-server start --host 0.0.0.0 \
--local stories=/data/tinystories
# Client (training nodes) — read the token from the file rather
# than pasting it inline (keeps it out of shell history and `ps`
# output).
chmod 600 ~/.fdss.token
export FORGATHER_DATASET_SERVER=http://datahost:8766
export FORGATHER_DATASET_SERVER_TOKEN="$(cat ~/.fdss.token)"
# Sanity check: status hits /v1/health + /v1/auth/status.
forgather dataset-server status
# Then your normal training command — datasets are now served
# over the wire:
forgather train
Cross-host via the webui bundle¶
For installs where the dataset_server was spawned via the forgather_server webui (Tools → Start Dataset Server…), there's a one-click cross-host transfer that bundles the URL and token into a single string:
- On the source machine, open Datasets → Servers, find the
running dataset_server in the Local list, and click Copy
bundle. The clipboard now contains a
forgather-dataset://host:port/?token=<urlencoded>URI. - On the destination machine, open the same view and click + Add server. In the modal, click Paste bundle from clipboard — the URL and Auth-token fields are populated in one step.
Treat the bundle as a credential — it is equivalent to an SSH private
key on the wire. Clipboard managers often sync content across devices
and may not redact token= query strings the way they do
password=… patterns. SSH-forward the port and skip the bundle
entirely if you don't trust the channel.
For the full token-resolution order (explicit kwarg →
FORGATHER_DATASET_SERVER_TOKEN → per-port localhost file →
none) see Authentication below.
Installation¶
The base Forgather install is sufficient — fastapi, uvicorn,
and httpx are already installed for the other servers.
The server runs as a stand-alone executable, as a Python module, or via the Forgather CLI. See Running below.
Authentication¶
The server requires a bearer token by default. Multi-user hosts share the same loopback addresses, so without auth any local user on the box could pull whatever datasets the server has cached.
This token model mirrors tools/inference_server exactly — same
file layout, same auto-discovery semantics, same --no-auth knob.
For the broader threat-model picture (how this token interacts with
forgather-server's bearer token and the per-job trainer-control
token), see the
forgather server threat model.
Default behaviour — on first start, the server generates a random
64-hex-char token and writes it to a per-port file under
~/.config/forgather/dataset_server/<port>.token (mode 0600 in a 0700
directory). The token is printed on stderr:
dataset_server auth token: 8f5b...
clients must send 'Authorization: Bearer <token>'
curl -H "Authorization: Bearer 8f5b..." http://127.0.0.1:8766/v1/datasets
persisted token file: /home/<you>/.config/forgather/dataset_server/8766.token
The token persists across restarts (mirroring how
forgather server handles its own bearer). On every subsequent
startup the server loads the existing per-port file rather than
minting a fresh value — so peers on other nodes that pulled the
token last week keep working through a restart. The stderr banner
on a reused token shows:
Rotating the token — pass --regen-token to mint a fresh value
and overwrite the persisted file. The startup banner gains a louder
header so the operator sees that any client still using the old
token is about to start 401-ing. After rotation, redistribute the
new token to peers (e.g. via the Datasets → Servers view's "+ Add server"
modal on each consumer).
RemoteBackend, the loader's _remote_load_iterable_dataset, and
the forgather dataset-server diagnostic CLI all auto-discover the
token when their URL is loopback (127.0.0.1, ::1, localhost).
So a client running on the same host as the server picks up the
token with no flags on either side.
The lookup is keyed by port. A server bound to 0.0.0.0:8766 and a
client connecting to http://127.0.0.1:8766 share the same file.
Supplying a known token — --auth-token TOKEN or
--auth-token-file PATH (mode 0600). Prefer the file form for
orchestrators since --auth-token is visible to other local users
via ps. When you supply a token explicitly the server does not
publish it to the shared file — operator-managed tokens stay where
the operator put them.
Disabling auth — --no-auth removes the bearer-token gate
entirely. The startup banner warns prominently. Only use it on
hosts where you're the only user, or where the bind address is on
a trusted network (e.g. an internal cluster interface).
Override env / explicit token on the client side:
# Force a specific token (override the file lookup):
export FORGATHER_DATASET_SERVER_TOKEN="..."
# Or pass per-invocation:
forgather dataset-server status --token "..."
Loading policy¶
The server has an explicit policy gate over POST /v1/load. This
controls what kinds of datasets the server is willing to
serve — separate from auth (who is allowed to ask).
Three knobs (all default to the safe option):
| Flag | Default | What it gates |
|---|---|---|
--no-hf |
off (HF cache enabled) | Loading any HuggingFace dataset id (e.g. allenai/c4). With --no-hf only local/* mappings work. |
--allow-paths |
off (paths rejected) | Loading by absolute filesystem path. Off by default — clients should use named local/* mappings instead. |
--allow-downloads |
off (cache-only) | Letting HF downloads happen when the dataset isn't cached. The server runs HF loads with HF_DATASETS_OFFLINE=1 unless this flag is set, so a cache miss surfaces as a 404 instead of starting a multi-hour download. |
Resolution order for a POST /v1/load request:
- If
pathstarts withlocal/<name>: look up the local mapping. 404 if unknown. If found, the resolved filesystem path is loaded — always allowed, no--allow-pathsneeded. - Else if
pathis an existing filesystem path: requires--allow-paths. Otherwise 403. - Else (assume HF dataset id): requires HF cache to be enabled
(so not
--no-hf). The load runs offline unless--allow-downloads; a cache miss surfaces as 404.
Named local datasets¶
The preferred way to expose a local dataset is via --local:
forgather dataset-server start \
--local stories=/data/tinystories \
--local mycorpus=/data/corpora/2024-01
Clients then request local/stories or local/mycorpus —
no need to know the server-side filesystem path:
--local is repeatable. The path must exist at server startup
(checked by argparse). Names must not contain /. The
local/ prefix in the client request is fixed (similar to the
HF namespace/name convention).
Why is path-loading off by default? Two reasons:
- It exposes the server's filesystem layout to whoever can reach the bind port.
- It makes the client request not-portable across nodes — the same path may not exist on every server. Named locals are an abstraction that keeps the client request stable.
Configuration file¶
For repeated invocations or many local mappings it's often easier
to load the server's options from a YAML file instead of typing
flags. Pass it via --config FILE:
The YAML keys mirror the CLI flag names with - replaced by _.
Boolean flags (--no-auth, --no-hf, --allow-paths,
--allow-downloads) are written as YAML booleans. The local
field uses a mapping (more idiomatic in YAML than the CLI's
repeated NAME=PATH form):
# ~/dataset_server.yaml
host: 0.0.0.0
port: 8766
log_level: INFO
# Auth (pick at most one)
no_auth: false # default: bearer-token auth on
# auth_token_file: ~/.fdss.token
# Loading policy
no_hf: false # default: HF cache enabled
allow_paths: false # default: path loading off
allow_downloads: false # default: cache-only HF
# Named local datasets — mapping form.
local:
stories: /data/tinystories
mycorpus: /data/saved_corpus
fineweb: /data/hf_caches/fineweb-edu/snapshots/abc123
Then start it with:
Default config path¶
If you omit --config, the server looks for
<forgather_config_dir>/dataset_server/config.yaml (on Linux,
~/.config/forgather/dataset_server/config.yaml) and loads it
when present. Missing default is silently ignored; an explicit
--config that points at a missing path still errors out.
The directory is the same one that already holds the per-port
auth-token files (<port>.token), so a single dataset_server/
directory under your forgather config dir contains all of the
tool's persistent state.
A startup line at INFO level (loaded default config: …) records
which file the server picked up so it's obvious whether the
default was applied. CLI-flag precedence is unchanged: anything
you pass on the command line still overrides the file values.
Precedence¶
CLI flags always win. Anything you don't pass on the command line falls back to the config-file value, and anything missing from both falls back to the script's built-in default.
--local is the one merge case rather than override case: the
file's local: mapping and any --local NAME=PATH flags on the
command line are unioned (CLI wins on a name conflict). This
matches the typical workflow of keeping permanent local mappings
in a config file and adding a one-off --local foo=/tmp/scratch
on the command line for an ad-hoc experiment.
What's allowed¶
Every CLI flag is a legal config key, except --config itself
(can't recurse) and --help. Unknown keys cause the server to
exit with an error so typos surface immediately rather than being
silently ignored.
PyYAML is required to parse the file (the server imports it
lazily); install via pip install pyyaml if it isn't already
present from the rest of forgather's dependencies.
Running¶
Stand-alone executable¶
# In-tree:
./tools/dataset_server/server.py --help
# Or via the interpreter:
python tools/dataset_server/server.py --help
python -m tools.dataset_server --help
Via the forgather CLI¶
start is a REMAINDER passthrough — every flag after start
goes to the underlying script unchanged. --help is forwarded.
Examples¶
# Default: HF cache enabled (cache-only, no downloads), no
# locals, paths disabled, auth on (auto-token).
forgather dataset-server start
# Cache-only HF + a couple of named locals.
forgather dataset-server start \
--local stories=/data/tinystories \
--local mycorpus=/data/saved_corpus
# Lock down to local mappings only — no HF, no paths.
forgather dataset-server start \
--no-hf \
--local foo=/data/foo \
--local bar=/data/bar
# Trusted-LAN mode: bind everywhere, disable auth.
forgather dataset-server start -H 0.0.0.0 --no-auth
# Allow path-based loads (development convenience):
forgather dataset-server start --allow-paths
# Allow HF downloads on cache miss (rare for a server role):
forgather dataset-server start --allow-downloads
# Rotate the persisted per-port auth token. Use after a suspected
# token compromise; existing peers will need to re-pull the new
# token from the server's stderr banner.
forgather dataset-server start --regen-token
The default port is 8766. The forgather orchestration server uses 8765 — pick a different port if you have to share a host.
Client routing (the env-var workflow)¶
# Terminal 1
forgather dataset-server start --local stories=/data/tinystories
# Terminal 2
export FORGATHER_DATASET_SERVER=http://localhost:8766
# Now any forgather command that calls fast_load_iterable_dataset
# transparently routes through the server. Example using a stock
# dataset config:
forgather -t fast-iter.yaml dataset --target train_dataset_split -n 3
The same call path works in training: every config template that
uses fast_load_iterable_dataset gets routed automatically when
the env var is set. No template edits, no client-side code changes.
To go back to local loading, unset the env var.
Cluster auto-routing (FORGATHER_DATASET_SERVER=auto)¶
When a forgather_server is running with --cluster <name>, the
client supports a special sentinel:
In this mode each fast_load_iterable_dataset call talks to the
local forgather_server's
/api/cluster/dataset_router/resolve?path=... endpoint, which
returns {base_url, auth_token} for a healthy dataset_server
somewhere in the cluster. The choice is uniform-random across
healthy candidates for crude load balance.
The client wrapper
(forgather.ml.datasets.resilient_remote_backend.ResilientRemoteBackend)
re-runs the resolver on every (re)connect, so a server that goes
down mid-iteration causes the next fetch to land on a different
healthy server — training continues from the captured stream
position with no operator intervention. The retry budget is
unbounded by default; set
FORGATHER_DATASET_CLIENT_MAX_RETRY_SECONDS to cap cumulative
backoff time and abort training after a sustained outage.
local/<name> is a global routing key: two dataset_servers
that both register local/stories (even pointing at different
filesystem paths underneath) are treated as interchangeable
replicas. This is the intended redundancy/load-balance model;
operators are responsible for choosing distinct names for distinct
datasets, as they already are within a single server.
For HF datasets and absolute paths the router picks any healthy
server and lets it /v1/load on demand. If that server can't
serve the request (cache miss, --no-paths, etc.) the client
catches the failure, re-resolves, and tries another.
See tools/forgather_server/README.md → Multi-node dataset routing
for the master-side machinery (collect / health / refresh loops),
the CLI diagnostics, and the Cluster tab in the webui.
Diagnostic CLI¶
forgather dataset-server status # health + auth + policy
forgather dataset-server list # loaded handles
forgather dataset-server cache # HF cache contents on the server
forgather dataset-server local # configured local mappings
All accept --server URL (default $FORGATHER_DATASET_SERVER,
falling back to http://127.0.0.1:8766), --token TOKEN
(falling back to $FORGATHER_DATASET_SERVER_TOKEN, then to the
per-port localhost file), and --json for machine output.
The cache action is the introspection feature — it reports
which HuggingFace datasets are already in the server's
~/.cache/huggingface/datasets/ (or $HF_DATASETS_CACHE),
broken down by config and split:
$ forgather dataset-server cache
cache_root: /home/dinalt/.cache/huggingface/datasets
datasets: 19
- allenai/c4 (1.4 TB)
en @ 0.0.0 -- train=364,868,892, validation=364,608
- HuggingFaceTB/smollm-corpus (1.1 TB)
cosmopedia-v2 @ 0.0.0 -- train=39,134,000
fineweb-edu-dedup @ 0.0.0 -- train=190,168,005
python-edu @ 0.0.0 -- train=7,678,448
- ...
Use this when you're not sure whether a particular HF dataset is already pre-warmed on the host you're about to serve from.
HTTP API¶
All /v1/* endpoints (other than the open ones) require
Authorization: Bearer <token> unless the server was started
with --no-auth. JSON bodies / responses; the streaming format
on /iter is newline-delimited JSON.
Open endpoints¶
GET /v1/health— returns service / version / current policy. Doesn't require auth so health checks work without managing tokens.GET /v1/auth/status—{"auth_required": bool}. Lets clients detect--no-authmode.
Gated endpoints¶
GET /v1/datasets— list currently loaded handles, with length, source, and load_args.GET /v1/datasets/{handle}— handle metadata.GET /v1/datasets/{handle}/length—{"length": int}.GET /v1/datasets/{handle}/iter?seed=&position=&limit=— NDJSON stream of examples.seedreshuffles before iterating;positionseeks to a flat example index first;limitcaps the number returned.POST /v1/load— body{"path", "name", "split", "data_files", "revision"}mirroringfast_load_iterable_dataset(...). Returns{"handle", "length", "source", "load_args"}. Subject to the loading-policy gate above. Cached by hash of the canonicalized args, so the same request returns the same handle.GET /v1/cache/hf— list HF datasets cached on the server's host. Walks$HF_DATASETS_CACHE(default~/.cache/huggingface/datasets/).GET /v1/local— list registeredlocal/*mappings.
Architecture notes¶
- Stateless wrt clients: every
/itercall carries the(seed, position)it should start from. No per-client state on the server — multiple clients can share a handle without trampling each other's iteration cursors. - Handle cache: keyed by
sha256(canonicalize(load_args)). Loaded backends live for the lifetime of the server (no LRU eviction; intentional — see out of scope). - Loaded backend = pure storage: the server stores the
ArrowBackendunderneath theComposableIterableDatasetthe loader returned. The wrapper layer (slice, shard, map, shuffle buffer, state_dict, multi-worker, length estimation) lives client-side on theRemoteBackendwrapper. Clients apply their own slice / shard / map without round-tripping it through the server. - Anti-recursion: the server lazy-loads via
_local_load_iterable_dataset(a public-but-underscore helper infast_hf_loader.py) which always loads locally, bypassing theFORGATHER_DATASET_SERVERenv var. This means the server can have the env var set in its own environment without looping back to itself. - Per-port token: matches the inference server. Each
dataset_server instance has its own token file
(
<port>.token), so you can run multiple instances on the same host without aliasing. - Streaming:
StreamingResponseoverapplication/x-ndjson. Clients consume line-by-line.BrokenPipeError/ConnectionResetErroron the server side are normal end-of-stream conditions when the client closes early; both are logged at INFO with the example count actually emitted.
Security considerations¶
The dataset_server is intentionally minimal and the threat model is narrow. This section is the operator's "what am I signing up for" checklist when deciding whether (and how) to expose it on a network.
Trust the dataset_server you point training at¶
The most important rule: every byte the server returns ends up in your training pipeline. A malicious or compromised dataset_server can:
- Poison the model. Crafted examples can teach the model whatever the
attacker wants — backdoors, biased completions, content that violates
policy, prompt-injection payloads that surface during inference. There
is no integrity check on dataset content; the wire format is "trust
what the server sends." Treat a dataset_server URL the same way you
treat a
pip installsource. - Exhaust resources. The NDJSON
/iterstream is unbounded. A hostile server can keep emitting examples until the trainer fills its tokenizer cache / disk / RAM. Mitigation: set training-sidemax_steps/ dataset budgets so a run is bounded regardless. See Trainer Options for the full list ofTrainingArgumentsknobs. - Probe internals. Examples are JSON. A server can return strings
containing HTML/JS, which the webui's Datasets → Servers tab
renders inside
<pre>(inert) — but the same strings flow into the trainer's example queue and through to checkpoints. If those ever surface in a downstream tool that does render HTML, you've shipped XSS through your training data.
If you didn't deploy the server yourself, or you can't audit who else has shell access to its host, don't add its URL to the registry. The "+ Add server" dialog in the webui is gated by the operator's explicit consent for exactly this reason — registering a URL is the authorization decision; the proxy will then happily forward to it.
Network exposure¶
- TLS is opt-in. Run
forgather tls initonce on the host and the dataset server picks up HTTPS off the shared CA at~/.config/forgather/tls/. Without it, the server refuses to bind non-loopback hosts unless--insecureis passed. Full walkthrough: docs/operations/tls.md. Alternatives: - keep the default loopback bind (
--host 127.0.0.1) and reach it via SSH port forwarding; - put it behind a reverse proxy that terminates TLS.
--no-authremoves the bearer-token gate. Anyone who can reach the bind port can list and stream every dataset the server has cached, plus thelocal/*mappings. On a multi-user host, loopback ports are not isolated by uid —--no-authon127.0.0.1is still reachable by every local account. Only use--no-authon a trusted-LAN bind to a single-user box.- Auth token storage. The auto-generated token lives at
<forgather_config_dir>/dataset_server/<port>.token(mode0600in a mode-0700dir) and persists across restarts. Anyone with root on the host, or anyone who compromises the user account that started the server, gets the token. Treat it as a uid-level credential. To rotate after a suspected compromise, run the server with--regen-tokenonce.
Server-side knobs and what they expose¶
All three loading-policy flags default to the safe choice. Each opt-in widens the server's exposure in a specific way:
--allow-pathslets clients request loads by absolute filesystem path. Off by default. Turning it on means:- A client can probe for the existence of paths on the server host
(
POST /v1/loadwith various paths returns 200 vs. 404). - A client can read any HF-loadable dataset under any path the
server's uid has read access to — including paths the operator
might not consider "datasets" (e.g. cached artefacts in
/tmp,/home/<other_user>/...if the perms allow it). - Prefer named
local/*mappings: they're an explicit allowlist and they hide the server-side path from the client. --allow-downloadslets cache misses trigger HF downloads on the server. Off by default. Turning it on means:- A client can fill the server's HF cache by requesting datasets that aren't there — useful for warming caches, abusive for filling disks. There's no per-client quota.
- The server now makes outbound HTTP requests to HuggingFace on behalf of clients. If your network policy forbids that for compliance reasons, leave the default in place.
--no-hfdisables HF cache loads entirely; onlylocal/*mappings are servable. This is the most restrictive policy — useful on a server whose only job is to host a curated dataset set.
Webui registry caveats¶
The Datasets → Servers tab's "+ Add server" registry stores {label, url,
auth_token} triples in
<forgather_config_dir>/server/dataset_server_registry.json (mode
0600). Points to keep in mind:
- It is not a credential safe. Anyone with the forgather-server bearer token can read the file via the API (or by reading the file directly if they have shell access). Treat stored dataset_server tokens with the same care as the forgather-server bearer itself.
- No URL validation beyond scheme + host parse. The proxy will refuse to forward to a non-registered, non-loopback URL — but it won't tell you the URL you registered is reachable, has a valid cert, or actually runs a dataset_server. Use the Status / Handles buttons to confirm before relying on it.
- The registry is the SSRF allowlist. That means a stolen forgather-server bearer is enough to register a new URL and then proxy to it. The token is uid-level (see the forgather-server threat model); registry abuse is bounded by what the bearer can do anyway.
Out of scope¶
The server is intentionally minimal. The following are explicit non-goals — call them out in any future PR if you want them discussed:
- Web UI. The forgather orchestration server already provides
one (sidebar 🗂 Datasets) that drives a dataset_server over its
same-origin proxy — see
tools/forgather_server/README.md"Datasets view". The dataset_server itself stays a headless API. - LRU / size-bound eviction of cached backends. Handles live for the lifetime of the server. For long-running servers with many distinct dataset configs this is a known limitation.
- Compression on the NDJSON stream. The wire format is plain JSON for now; the bottleneck in early measurements has been GIL contention and example construction, not bytes-on-wire.
- Rate limiting / quota.
- Inter-server auth carve-outs (the cluster-peer pattern in
forgather_server). Not needed for the dataset use case. - Sharing the forgather_server's global token. Per-port (inference-style) is the chosen model.
Related docs¶
- Loader internals: Fast HF Loader —
what
fast_load_iterable_datasetactually does on the local path. - Checkpointing: Fast HF Loader Checkpoints — state_dict / load_state_dict semantics, which the RemoteBackend-based wrapper inherits unchanged.
- Other forgather servers: forgather_server, inference_server.