Multi-node training with the Forgather server¶
This guide is for operators who want to run a training job across more than one machine. It covers the practical setup, the submit flow, common failure modes you should expect to see, and the diagnostic tools available when something hangs.
If you've used the Forgather server in single-node mode (one host, one or more GPUs), the multi-node flow is a small extension: a few extra command-line flags at server startup, a new panel in the Run dialog, and the same Cluster Jobs view to monitor what's happening.
What you'll do:
- Decide whether multi-node fits your workload
- Set up the prerequisites
- Start the cluster
- Submit a multi-node job
- Monitor and diagnose
- Operational notes
Companion references for when you need them:
- Forgather Server reference — the cluster section there has the full feature/API reference; this guide extracts the operator-facing parts.
- Pipeline Parallel — when to pick PP over DDP/FSDP, how to size stages.
- Checkpointing — distributed checkpoint behaviour, including the shared-FS single-writer rules multi-node depends on.
When to use multi-node¶
Multi-node training across consumer Ethernet (1G or 2.5G) is only viable for pipeline parallel (PP). The rule of thumb:
| Strategy | Communication pattern | Practical inter-node bandwidth needed |
|---|---|---|
| Pipeline Parallel (PP) | activations between adjacent stages, once per microbatch | works on 1G+ Ethernet; the trainer Forgather just tested on |
| DDP | full gradient all-reduce every step | needs ~10G+ for non-trivial models; viable on 1G for ~1B-class |
| FSDP | parameter all-gather on every layer | needs NVLink or fast InfiniBand; near-unusable on Ethernet |
| Tensor Parallel | per-layer all-reduce | needs NVLink |
Forgather's multi-node infra is agnostic to the trainer choice — you can submit any DDP/FSDP/PP config across the cluster — but on a typical home-lab LAN, only PP will actually train at a useful rate. DDP across two boxes on a 1G link will work, but the gradient all-reduce will dominate and step time will be much longer than the single-host equivalent.
The reference end-to-end run (Llama2-7B + Samantha + 1F1B PP across a 1+2 GPU layout on a 1Gbit link) achieves all-three-GPUs-at-100% util, ~28% link utilisation, ~40% MFU. That's a healthy regime: compute is the bottleneck, network has headroom.
Prerequisites¶
Multi-node operation has a small but firm list of preconditions. Skip any of these and you'll hit a hard-to-diagnose failure later.
Shared filesystem¶
Every selected node must see identical absolute paths for:
- The Forgather repo (so
forgather trainfinds the same templates, scripts, generated model code). - The project directory (
-p /path/to/project). - The dataset (whatever the project's dataset config resolves to — typically a HuggingFace cache or a local data root).
- The output directory (where checkpoints land).
The simplest setup is an NFS export mounted at the same path on every node. Less popular but workable: a synchronised local layout (rsync, syncthing) where the absolute paths happen to match.
The submit flow assumes shared paths and does not stage anything for you. There is no automatic config replication or dataset proxy.
Beware host symlinks. Local-host conveniences like
~/ai_assets/forgather → /mnt/rust/aiassets/forgather resolve on
the originating host but the symlink itself isn't visible inside
container peers. Pass the canonical path (the symlink's
target, not the symlink) to anything that flows verbatim to other
peers — the webui's project_dir field, forgather -p ...,
PROJECT_DIR= in the smoke test. readlink -f <path>
canonicalises programmatically.
Same Linux user / UID¶
Files written by one node need to be readable by the others. The
easiest way is to make sure all nodes run Forgather under the same
username + matching UID/GID. The example cluster used dinalt
(UID 1001) on every node.
Matching software versions¶
Multi-node training is exquisitely sensitive to torch / NCCL version
mismatches. The cluster's pre-flight probe surfaces these in two
places — the sidebar Nodes group flips the affected peer's dot to
yellow with a tooltip naming the diverging version, and the Cluster
view's Nodes tab shows the full per-key chip row with the cluster
header carrying a version mismatch tag:
forgather— should match exactly.torch— must match exactly. NCCL version is bundled with the torch wheel.transformers— minor mismatches usually work but produce subtle bugs.python— patch version differences are usually fine.
If you submit with mismatches, the server returns HTTP 409 unless you explicitly acknowledge with the "I understand — submit anyway" checkbox in the Multi-node panel.
Trusted LAN¶
Inter-node API calls inside the cluster carve out auth for known
peers — they're identified by a CA-signed TLS client certificate
(mTLS, see docs/operations/tls.md) and don't need a bearer
token. The threat model still assumes the cluster as a whole is
trusted (consistent with torch.distributed's own assumption — any
peer can submit jobs on any other peer, which is arbitrary code
execution). If you don't trust the operators of every node in your
cluster, don't enable cluster mode.
Container PID 1¶
If you run the Forgather server inside a Docker container, PID 1 must be a proper init process that reaps orphan grandchildren. Otherwise, when torchrun gets killed, its worker subprocesses re-parent to PID 1, never get reaped, and pile up as zombies inside the container.
The bundled docker/run passes --init for exactly this
reason — Docker's tini becomes PID 1 and reaps orphans regardless
of parentage. If you bring your own container, make sure either you
add --init to the docker run invocation, or run a tini/dumb-init
process as the entrypoint.
To pick up the --init change on an existing forgather-dev
container:
That recreates the container; existing zombies vanish with the old PID 1.
Starting the cluster¶
On every node, start the Forgather server with --cluster <name>
and bind to all interfaces so peers can reach the API:
The cluster name is per-invocation (not persisted). Two unrelated clusters on the same LAN that use different names won't auto-merge.
Each node mints a stable UUID at first cluster startup (saved at
~/.config/forgather/cluster/node_id, mode 0600). Master selection is
deterministic — the lowest UUID among reachable members wins. No
election round-trip; if the master goes down, the survivors keep
running and a new master is selected within ~30 seconds.
Deployment options¶
Three reasonable ways to get the server running on every node:
1. Host install — install Forgather on every node directly,
run forgather server from each. Right when you're iterating on
Forgather itself or already have host venvs everywhere.
2. Dev image (docker/run) — long-lived dev container with
the host source bind-mounted. Fine for testing on your own
machine; awkward to deploy to N nodes because each needs a host
clone.
3. Runtime image (docker/runtime/run.sh) — pre-built,
self-contained image baked from one commit. The supported way
to deploy a Forgather cluster: build once, push or
docker save/load, run identical copies on N nodes. The
runtime image's --cluster plumbing is first-class — see
Docker images for the full env-var
surface. Quick start:
NETWORK=host CLUSTER=lab docker/runtime/run.sh
# Or for testing on a trusted LAN with no token gate:
NETWORK=host CLUSTER=lab NO_AUTH=1 docker/runtime/run.sh
NETWORK=host is required for multi-node — mDNS multicast
doesn't traverse Docker bridge networks. NO_AUTH=1 skips the
bearer-token gate so cluster CLIs across N containers don't have
to fetch tokens; trusted-LAN only.
For an end-to-end script that builds, deploys, runs, verifies, and tears down a multi-node test against the runtime image, see Smoke-testing multi-node end-to-end below.
Verify cluster discovery¶
Open the webui on any node and click Nodes in the sidebar. You should see one bounded box per cluster member, with:
- Hostname and IP address
- "this node" / "master" tags as appropriate
- GPU summary (one card per GPU)
- Version chips (forgather / torch / nccl / transformers) — yellow if a node diverges from the cluster majority
- A collapsible network interfaces panel
If a node is missing, check:
- Is the server actually running there?
curl http://<peer-ip>:8765/api/cluster/selfshould return its identity. - Is mDNS reaching across the LAN? Some VPN configs and managed switches block multicast.
- Did both servers start with the same
--clustername?
Container address advertisement¶
Inside a container without --network host, the server's psutil
view of network interfaces only sees the container's namespace —
not the host's real LAN interfaces. The auto-detector falls back to
127.0.0.1 and emits a WARNING; peers on other hosts cannot reach
that address.
Tell the server which routable address to advertise:
--cluster-address is repeatable if your node has multiple routable
addresses you want advertised.
You can also use --network host on the docker run command (the
default in docker/run) so the container shares the host's
network stack and address advertisement Just Works.
Optional: network probe¶
The Cluster view's network tab has a Refresh button that runs two probes against each reachable peer in turn:
- Latency — 30 keepalived round-trips over HTTPS, warmup- trimmed, reported as min / median / max ms.
- Bandwidth — adaptive parallel-stream raw TCP throughput
test (4 streams in flight, sized for ~2 s of steady-state
transfer per stream). Coordination flows over the authenticated
HTTPS channel; the actual bytes go over a one-shot ephemeral
plain-TCP listener so Python's
sslmodule isn't the bottleneck on fast links. Numbers should match whatnetperf -t TCP_STREAMreports on the same wire.
Sequential across peers because two simultaneous bulk transfers would saturate the local NIC and under-report each link. The row being measured shows "Measuring…" so the operator sees per-peer progress. Results are cached for 1 hour.
This is a sanity check — if the measured bandwidth or latency between two hosts is much worse than your link's nominal numbers, you have a network problem to fix (wrong interface routing, broken cable, duplex mismatch) before submitting any training jobs.
Submitting a multi-node job¶
There is no separate "Multi-node" submit button. Multi-node submits go through the same dialog as single-node submits — the Run… action on any training config.
Open the Run dialog¶
In the project tree on the left, navigate to a project that contains
a training config (e.g. examples/finetune/samantha). Click on a
config (e.g. llama2_7b/2gpu_pp.yaml) to open it in the right
pane, then click ▶ Run at the top of the config viewer. Or
right-click the config in the tree and pick ▶ Run….
When the server is in cluster mode, the Run dialog shows two collapsible sections in addition to the usual GPUs / Priority / Dynamic args:
- Multi-node — present only when
--clusteris active. - Dynamic arguments — same as in single-node mode.
Both default to open. Each is capped at 50% of the dialog body when both are open, so neither pushes the other off-screen.
Pick participants¶
The Multi-node panel lists every cluster member as a row with five columns:
| Column | Purpose |
|---|---|
| Use | checkbox — opt this node into the run |
| Node | hostname + address, with "this node" / "master" / "unreachable" tags |
| GPUs | how many GPUs this node contributes (= per-node nproc_per_node); bounded by the node's hardware count, with (N idle of M) hint |
| NCCL iface | dropdown of the node's up-and-running interfaces, defaulting to (auto) — see below |
| rdzv host | radio — which member hosts the rendezvous TCPStore |
By default, the local node is checked and every other node is unchecked. Adding peers turns the submit into a fanout. The Submit button label changes to "Submit to N nodes" so you know which path you're taking.
If you leave only the local node checked, Submit goes through the regular single-node enqueue path (no rendezvous overhead). If you deselect every node, Submit is blocked.
Pick the rendezvous host¶
The default is the cluster master. For a typical 2-node setup, you can leave this alone. For larger clusters, prefer the node with the lowest cross-node latency — the rdzv host is contacted by every other node during initialization.
Pick the network interface (or leave it on auto)¶
The NCCL iface dropdown shows interfaces from each node's pre-flight probe. Pick the one you want NCCL/Gloo/tensorpipe to use for cross-node traffic.
Leaving it on (auto) is the right default in most cases: the
server matches each member's cluster-advertised IP against its probe
interface table and picks the matching iface name automatically. You
only need to override this if:
- You have multiple routable interfaces on a node (e.g. a slow management interface and a fast training interface) and the cluster picked the wrong one for advertisement.
- You're testing with VLAN-tagged or specially-routed traffic.
If no interface can be derived (probe data missing, address
mismatch), the submit returns HTTP 422 rather than spawning a job
that would deadlock in connectFullMesh.
Pick the rendezvous port¶
The Rendezvous port field at the top of the panel defaults to
29400. Change it if 29400 is in use (or if you're running multiple
multi-node submits side-by-side, which would clash on the same
port). Any free TCP port on the rdzv host works.
Asymmetric topologies are fine¶
You can mix nodes with different GPU counts. The example reference cluster ran 1 GPU on one node + 2 GPUs on another, world_size=3, 1F1B PP. The trainer's local-process-group machinery is hostname-discovered, so heterogeneous layouts produce correct groups.
The rule for picking per-node GPUs depends on the trainer:
- PP: pick the GPUs that maximize compute and minimize cross-node
hops. World size = total GPUs across all nodes; stages = world
size ×
stages_per_rank(default 1). Each rank owns one or more stages. Asymmetric layouts mean some nodes own more stages than others — the splitter's auto-balancer handles this fine for typical decoder-only LMs. - DDP: pick the same number of GPUs per node when possible. Asymmetric DDP works but the small-rank-count node will be the straggler at every allreduce.
Submit¶
Click Submit to N nodes. The master:
- Validates participants are reachable and version-compatible.
- Generates a rendezvous ID and computes the rdzv endpoint.
- Fans out a per-rank training enqueue to each peer with their
torchrun args (
--nnodes,--node-rank,--rdzv-backend=c10d, etc.) and the rightNCCL_SOCKET_IFNAME/GLOO_SOCKET_IFNAME/TP_SOCKET_IFNAMEenvironment. - Records a ClusterJob bundle linking the per-rank queue ids back
to a single
cluster_job_id.
If a fanout step fails partway through, the master rolls back by issuing cancels to the participants it already enqueued on. There's no half-submitted state.
Last-used Multi-node settings (participants, per-node GPUs, iface, rdzv host/port, mismatch acknowledgement) persist alongside the config's dynamic-args overrides — so a config "opens where you left off" for both single- and multi-node submits. Reset to defaults in the dialog clears everything we cached for this config.
Driving multi-node from the CLI¶
Everything the webui submit dialog does has a corresponding
terminal command — useful for shell-driven smoke tests, scripted
deployments, and CI. The forgather cluster subcommand mirrors
the existing forgather sched / job / gpu shape:
forgather cluster nodes # member table + GPU summary
forgather cluster jobs # list bundles with rolled-up status
forgather cluster jobs <bundle-id> # bundle detail (per-rank status)
forgather cluster cancel <bundle-id> # fan out cancel to every participant
# Submit (note that -p / -t are GLOBAL forgather flags, like train):
forgather -p <project-dir> -t <config> cluster submit \
[--member host:gpus[:iface] ...] # repeatable; default = every reachable peer's idle GPUs
[--rdzv-host hostname] [--rdzv-port 29400]
[--priority N] [--dynamic-arg KEY=VAL ...]
[--allow-version-mismatch] [--wait]
All commands accept --server URL or $FORGATHER_SERVER_URL
(default http://127.0.0.1:8765).
Hostname-based topology¶
Hostnames in --member resolve to UUIDs via the cluster's
membership table — you never type a UUID. So:
is much easier to read than the equivalent webui modal would be in JSON.
--member defaults¶
Omit --member entirely and submit defaults to "every reachable
peer with all of its idle GPUs." That's the right call for a
smoke test or a "use everything you've got" run:
--wait for shell scripts¶
--wait polls the bundle until it terminates and exits 0 on
done, non-zero on failed/cancelled/timeout — handy when
scripting:
forgather -p <proj> -t <cfg> cluster submit --wait \
&& echo "training succeeded" \
|| echo "training failed"
Path resolution gotcha¶
The -p <project-dir> path is sent verbatim to every peer.
If your local path goes through a host symlink (e.g.
~/ai_assets/forgather → /mnt/rust/aiassets/forgather), the
remote peer sees the host-side link target which may not exist
inside its container. Always pass the canonical path that
resolves identically on every peer:
# Wrong if ~/ai_assets/forgather is a symlink not visible inside containers:
forgather -p ~/ai_assets/forgather/examples/foo cluster submit ...
# Right — canonical NFS path that every peer mounts at the same location:
forgather -p /mnt/rust/aiassets/forgather/examples/foo cluster submit ...
readlink -f <path> resolves a path to its canonical form if you
need to compute it programmatically.
Monitoring and diagnosing¶
Cluster Jobs panel¶
In the Cluster view's jobs tab (sidebar 🖧 Cluster), the Cluster Jobs card lists running and recently-finished bundles. Each row shows:
- Truncated bundle id
- Project / config
- Rendezvous endpoint + id
- Per-rank assignments (rank → hostname × GPUs, with current status)
- Rolled-up status —
running/done/failed/cancelled - Cancel button (active only while running)
The rolled-up status is computed from per-rank queue item statuses
across the cluster; partial completion shows up as running, and
done requires every rank to be terminal. Slow/unreachable peers
contribute unknown for their rank without blocking the list.
Per-node TTY logs¶
To watch a specific rank's torchrun output, open that node's webui and find its queue item in the Jobs panel. Each peer's webui shows its own ranks; there is no cross-node TTY aggregation in v1.
The local TTY view supports live tail and full dump. WebSocket streaming follows the file.
Hang diagnosis with kill -USR1¶
Multi-node training hangs are common while you're getting a new config working. The two failure modes that produce no Python traceback are:
- Silent rank death — a worker exits from a CUDA driver assertion, kernel OOM-kill, or unhandled C++ exception in a background thread.
- Distributed deadlock — every rank is blocked in a
dist.*collective, waiting for partners that aren't coming.
Forgather's train_script.py enables Python's faulthandler so
both cases produce useful diagnostic output:
- Crashes (SIGSEGV / SIGFPE / SIGABRT / SIGBUS / SIGILL) dump every thread's Python stack to stderr → per-rank TTY log → visible in the Jobs view.
kill -USR1 <pid>against any rank's worker python dumps that rank's full thread stacks to its TTY without killing the process. Same idiom aspy-spy dump, but works inside containers that stripCAP_SYS_PTRACE.
To find the worker PIDs: in the Jobs panel, click on the rank's entry and the PID is shown in the metadata. Or from a shell on the node:
The torchrun parent and one or more worker python processes will
be listed. SIGUSR1 the workers, not torchrun.
After SIGUSR1, look at the rank's TTY log (tail -200
~/.config/forgather/server/jobs/q_*.tty). Each thread's stack ends with
the most recent frame, so you can read the deadlock site directly.
Common deadlock sites and what they mean:
| Stack tip | What's stuck | Common cause |
|---|---|---|
dist.new_group in _new_process_group_helper |
one rank is creating a process group that other ranks aren't | mismatched local_world_size assumptions; topology bug |
dist.all_gather / all_gather_object |
one rank hasn't reached the call yet | uneven progress between ranks |
dist.barrier |
one rank failed earlier; others now wait forever | scroll up in TTYs to find the silent failure |
_PipelineStage._receive |
upstream stage hasn't sent the activation | upstream rank crashed or is in a different scheduler op |
NCCL ncclSendRecv |
inter-stage activation/gradient transfer | NCCL_SOCKET_IFNAME wrong on one side; firewall |
Topology log¶
When the cluster forms its first node-local process group, rank 0 logs the discovered topology:
[Rank 0] forgather.ml.distributed - INFO - get_local_process_group:
discovered topology {'muthur': [0], 'wopr': [1, 2]}
This tells you, at a glance, which ranks live on which hosts. If you ever see ranks distributed unexpectedly, the topology log is the first place to look.
Per-rank host= tag¶
Every rank's DistributedEnvironment(...) log line includes a
host=<hostname> field. So when SIGUSR1 dumps a stack, you can
correlate "rank N is hung at
Operational notes¶
Cancelling a job¶
The Cluster Jobs panel's Cancel button fans out a cancel to every peer of the bundle. Each peer's local scheduler then aborts its rank's queue item. You can issue this from any node — non-master nodes proxy the request to master.
If Cancel doesn't take effect within a few seconds (e.g. one rank
is stuck in an uninterruptible kernel call), you can use the
per-node Jobs panel's right-click context menu to Force kill
(SIGKILL) that specific rank. Backend polls for the PID to
actually exit (up to 2 seconds) and stamps the JobRecord's error
field if it's still alive afterwards — so a stuck-in-CUDA process
surfaces in the UI instead of silently leaving a phantom GPU
consumer.
Cleaning up stale endpoints¶
If the Forgather server itself was restarted while a torchrun was
running, the in-flight trainer's endpoint files (under
~/.config/forgather/jobs/job_*/) can outlive their process. They surface
as endpoint-only entries in the Jobs panel — entries with no
source: record or merged, just endpoint.
By default these are filtered out of the Jobs list when their PID is dead/zombie/recycled. To see them, toggle "Include dead endpoints" on the Jobs panel header. Then right-click → ✕ Remove stale endpoint to rmtree the directory and stop the entry from showing up.
Trainer control endpoint is local-only¶
forgather control list and the trainer's save / stop / abort
HTTP endpoint are bound to 127.0.0.1 on whichever node hosts rank
0. So:
forgather control listshows the run on the rank-0 host only (e.g. muthur in the reference cluster); empty on other nodes.- Save/stop/abort commands targeting rank 0 must be issued from a shell or webui on the rank-0 host.
- The Cluster Jobs panel's Cancel button still works from anywhere — it goes through the JobRecord-level cancel-fanout, not the per-trainer control endpoint.
This is a known v1 limitation. For a save/stop on the running run, ssh to the node hosting rank 0 (look at the topology log to find which one) and run the command there.
MFU and heterogeneous clusters¶
The reported MFU number is correct for homogeneous clusters
(every rank's GPU is the same model). It's computed as
achieved_aggregate_flops / (world_size × per_GPU_peak), where
per_GPU_peak is auto-detected from rank 0's device.
If your cluster is heterogeneous (e.g. mixed RTX 3090 + RTX 4090,
or pairing a Spark with a desktop GPU), the aggregate peak is
wrong and the MFU number is meaningless. Workaround: set
peak_hardware_flops explicitly per-config based on the
slowest-tier device, or stick to homogeneous training clusters
until probe-driven aggregation lands.
Checkpoints on shared FS¶
When several ranks share a filesystem (NFS, the typical multi-node
setup), only one rank globally writes the model shard files. The
CheckpointManager honours save_on_each_node=False (the documented
default for shared storage) by gating the shard-file save on a
single-writer rank, so concurrent writers can't race on the same
shard paths.
For pipeline-parallel training (save_on_all_ranks=True in the
trainer), every rank writes its own non-overlapping shard files
based on which FQNs (parameter names) its stage owns — different
stages own disjoint FQN sets so there's no collision.
This means a multi-node PP run on shared NFS produces a complete, correct checkpoint that any other Forgather process (e.g. the inference server) can load as a normal model. The reference run verified this end-to-end: Llama2-7B+Samantha trained 3-stage across 1+2 GPUs, saved to NFS, loaded for inference, generated coherent output.
Smoke-testing multi-node end-to-end¶
A pre-built shell script exercises the entire flow against the runtime image, with no manual steps:
tests/smoke_runtime_multinode.sh # default: REMOTE=muthur, port 18765
tests/smoke_runtime_multinode.sh --no-build # skip image rebuild + deploy (after first run)
tests/smoke_runtime_multinode.sh --keep # leave containers running on success
tests/smoke_runtime_multinode.sh --remote box2 # alternate remote host
SMOKE_PORT=28765 tests/smoke_runtime_multinode.sh # override the test port
What it does:
- Builds the runtime image locally (
Dockerfile.runtimewithFORGATHER_SOURCE_DIR=.so it tests your working tree, not whatever's ondev). - Saves the image to a shared-FS tarball (default
/mnt/rust/aiassets/.tmp/forgather_smoke.tar) anddocker loads it on the remote via ssh. - Starts a server container on each host with
--cluster <name>,--no-auth, and--network host. - Waits up to 90 seconds for the cluster to form (both members reachable in the membership table).
- Submits Tiny Llama v2 across every reachable GPU using
forgather cluster submitfrom inside one of the containers. - Polls the bundle until terminal status; asserts
done. - Verifies a checkpoint landed on the shared FS with at least one shard file.
- Tears the cluster down (unless
--keepis passed).
On any failure, the EXIT trap dumps a single triage file under
/mnt/rust/aiassets/.tmp/smoke-<cluster>-failure-<ts>.log:
docker logs from both hosts, cluster nodes/jobs JSON, the latest
TTY log from each peer, and nvidia-smi output. One place to
look.
Prerequisites¶
- Passwordless ssh from the local host to the remote
(
ssh <REMOTE>returns immediately with no password prompt). - Same NFS path on both hosts — the script bind-mounts
/mnt/rust/aiassetsinto both containers; the project_dir for Tiny Llama lives there. If your shared mount is elsewhere, edit the script's bind-mount lines (search for/mnt/rust/aiassets) or setPROJECT_DIR=to the correct host path. - Both hosts have docker with
--gpussupport (NVIDIA Container Toolkit installed). - A non-default port available on both hosts — defaults to
18765 to avoid colliding with
forgather-devcontainers that may already be using 8765 under host networking. Override withSMOKE_PORT=if even 18765 is taken.
What it tests¶
The smoke test is the canonical end-to-end check that the runtime image's multi-node story works: image build, image distribution via shared FS, cluster discovery via mDNS, the trusted-LAN no-auth path, the cluster CLI's submit fanout, the trainer's per-rank coordination, the checkpoint coordinator's single-writer behaviour on shared FS, and cleanup. If you change anything in the runtime image, the cluster runtime, the cluster CLI, or the checkpoint code, run this first.
The reference run on a 1+2 GPU cluster (1 RTX 3090 on the remote, 2 RTX 3090s on the local) trains Tiny Llama v2 (4M params, TinyStories) in ~60 seconds of compute time, ~80 seconds total (cluster form + submit + train + verify) on a warm cache.
Using it as a template¶
The script is meant to be copyable. To smoke-test a different
config (e.g. a custom finetune), pass --project-dir and
--config:
tests/smoke_runtime_multinode.sh \
--project-dir /mnt/rust/aiassets/my-project \
--config train.yaml
For more elaborate tests (multi-step submit, mid-run cancel,
checkpoint resume), copy the script and adapt — the helpers
(reachable_count, the diagnostic dump trap, the bundle-polling
loop) are independently useful.
Known limitations and caveats¶
- No global scheduler. Per-peer scheduling decisions are independent. Cluster job submits use a static fanout at submit time; there is no live re-balancing or cross-node preemption. If you submit a multi-node job to a cluster where one peer is busy with another single-node job, the multi-node job will wait for that peer's GPUs to free up.
- No automatic config staging. Project paths must resolve to the same files on every participant.
- No cross-node TTY aggregation. Open each peer's webui to watch its rank's torchrun output. Cross-node Jobs / Files / TTY proxying through the master is on the roadmap (Phase 3.5) but not in v1.
- No automatic master failover. The master is whichever reachable member has the lowest UUID; if it goes down the cluster keeps running with a new master, but in-flight global state (queue mutations during the gap) is lost. Master failover with state replication is on the roadmap (Phase 4 + Phase 5).
- No cross-architecture training. The version probe surfaces a platform mismatch (e.g. ARM Spark + x86_64 desktop) but torch wheels and CUDA kernels won't actually interoperate across architectures. The check is advisory; the operator is on the hook for whether their cluster makes sense.
Where to learn more¶
- Forgather Server reference — full feature/API reference. The Cluster mode section has the authoritative threat model, mDNS discovery details, and API table.
- Pipeline Parallel — scheduling choices (1F1B, ZBV, GPipe, interleaved), stage layout, and the splitter API.
- Checkpointing — distributed checkpoint behaviour, resume, sharing patterns.
- Training Performance Metrics — how MFU, tok/s, and FLOPs are computed.
- Forgather Server CLI —
--enqueue,forgather sched,forgather job,forgather gpufrom the terminal. Theforgather clustersubcommand documented in Driving multi-node from the CLI is the multi-node counterpart. - Docker images — full reference
for the dev and runtime images, including the multi-node
CLUSTER/NO_AUTH env-var surface on the runtime image and the
--init/ HEALTHCHECK plumbing.