Trainer Options Reference¶

This document is a consolidated reference for all training-argument fields and constructor parameters across Forgather's built-in trainers. It complements the docstrings and per-field comments that live next to the dataclasses themselves (see the Source of Truth section at the bottom).

Trainers covered:

forgather.ml.trainer.Trainer - lightweight single-device trainer
forgather.ml.trainer.accelerate.AccelTrainer - multi-GPU via HF Accelerate
forgather.ml.trainer.ddp.DDPTrainer - multi-GPU via raw PyTorch DDP
forgather.ml.trainer.fsdp2.FSDP2Trainer - sharded DP via torch.distributed.fsdp.fully_shard
forgather.ml.trainer.pipeline.PipelineTrainer - pipeline parallel

The torchtitan-based trainer is out of scope - see the torchtitan docs.

Training-Argument Hierarchy¶

Training arguments are defined as a chain of dataclasses, each subclass adding fields on top of the previous:

MinimalTrainingArguments    (HF-compatible baseline)
   v
BaseTrainingArguments       (adds checkpoint, AMP, FP8, SDPA, anomaly, ...)
   v
TrainingArguments           (adds memory/compile options for simple Trainer)
   v
 +-- AccelTrainingArguments     (no new fields; trainer overrides behaviour)
 +-- DDPTrainingArguments       (adds dispatch_batches, DDP, Post-Local-SGD)
 +-- FSDP2TrainingArguments     (adds dispatch_batches, fsdp2.* policies)
 +-- PipelineTrainingArguments  (adds pipeline-schedule/microbatch fields)

When you configure a trainer via YAML, every field from every class in the chain above is available. Defaults propagate down and may be overridden at any level.

MinimalTrainingArguments¶

HuggingFace-compatible baseline. Defined in trainer_types.py.

Output and execution¶

Field	Type	Default	Description
`output_dir`	str	`tmp_trainer`	Directory for model output and checkpoints.
`logging_dir`	str \| None	auto	TensorBoard log dir. Defaults to `<output_dir>/runs/<timestamp>_<host>`.
`device`	Any	None	Device override (cuda, cpu, etc.). Auto-detected if None.
`seed`	int	-1	RNG seed. `-1` disables seeding.
`use_cpu`	bool	False	Force CPU even if CUDA is available.

Steps and epochs¶

Field	Type	Default	Description
`num_train_epochs`	int	1	Total epochs to train. May be fractional. A negative value disables the epoch cap so training runs until `max_steps` (or an external stop signal) -- useful with schedulers like WSD / InfiniteLR where the budget is expressed purely in steps.
`max_steps`	int	-1	If > 0, overrides `num_train_epochs` with an absolute step count. When `num_train_epochs < 0`, `max_steps` is the only training-length cap, so it must be > 0.
`epoch_train_steps`	int	100000	Fallback epoch length for datasets that don't implement `len()`. Forgather extension.

Batching and data loading¶

Field	Type	Default	Description
`per_device_train_batch_size`	int	16	Training batch size per device. Global batch = per-device * num-devices * `gradient_accumulation_steps`.
`per_device_eval_batch_size`	int	16	Eval batch size per device.
`gradient_accumulation_steps`	int	1	Accumulate gradients over N micro-batches before stepping the optimizer.
`max_grad_norm`	float \| None	None	Clip gradients to this L2 norm. `None` disables clipping.
`dataloader_num_workers`	int	0	Worker processes for data loading. `0` = main process.
`dataloader_pin_memory`	bool	True	Pin memory in DataLoader for faster host-to-device copies.
`dataloader_persistent_workers`	bool	False	Keep worker processes alive between epochs.
`dataloader_prefetch_factor`	int \| None	auto	Batches prefetched per worker. Defaults to 2 when `num_workers > 0`.
`dataloader_drop_last`	bool	False	Drop the last incomplete batch.

Eval / logging cadence¶

Field	Type	Default	Description
`eval_strategy`	str	`"no"`	When to run eval: `"no"`, `"steps"`, or `"epoch"`.
`eval_steps`	int	100	Eval frequency in steps (when `eval_strategy="steps"`).
`eval_delay`	int	0	Epochs/steps to wait before first eval.
`logging_strategy`	str	`"steps"`	`"no"`, `"steps"`, or `"epoch"`.
`logging_steps`	int	50	Log cadence in steps.
`logging_first_step`	bool	False	Whether to log the very first global step.

torch.compile¶

Field	Type	Default	Description
`torch_compile`	bool	False	Enable `torch.compile()` on the model.
`torch_compile_backend`	str \| None	`"inductor"`	Backend passed to `torch.compile(backend=...)`.
`torch_compile_mode`	str \| None	`"default"`	`"default"`, `"reduce-overhead"`, `"max-autotune"`, or `"max-autotune-no-cudagraphs"`.
`torch_compile_dynamic`	bool	True	Allow dynamic input shapes.
`torch_compile_full_graph`	bool	False	Force full-graph compilation (no graph breaks).

Notes:

max-autotune is incompatible with gradient_accumulation_steps > 1 and with pipeline parallel training. Use max-autotune-no-cudagraphs in those cases.
Zero-bubble pipeline schedules (ScheduleInterleavedZeroBubble, ScheduleZBVZeroBubble) are incompatible with torch_compile regardless of mode, because AOTAutograd flattens the compiled stage interior and the split backward cannot walk the resulting graph.

Checkpointing¶

Field	Type	Default	Description
`save_strategy`	str	`"steps"`	`"no"`, `"steps"`, or `"epoch"`.
`save_steps`	int	1000	Save cadence (when `save_strategy="steps"`).
`save_total_limit`	int	2	Maximum checkpoints to keep. Best-marked checkpoints are preserved beyond this.
`save_safetensors`	bool	True	Use safetensors format for weights.
`save_on_each_node`	bool	False	In multi-node runs, save on each node. Don't use with shared storage.
`overwrite_output_dir`	bool	False	Overwrite existing contents in `output_dir`.
`resume_from_checkpoint`	bool \| str	True	`True` = auto-resume from latest; `False` = force fresh; path = resume from specific.
`load_best_model_at_end`	bool	False	Load best checkpoint at end of training. Requires `save_strategy == eval_strategy`.
`metric_for_best_model`	str	`"loss"`	Metric used to compare checkpoints.
`greater_is_better`	bool \| None	auto	Override the metric direction. Inferred from metric name if None.

See docs/checkpointing/user_guide.md for a practical guide to checkpointing.

Notes:

save_safetensors is incompatible with saving tied weights. The format was created to address a real security issue with PyTorch's native format, which allowed arbitrary code execution. PyTorch has since addressed the security and performance issues.

Optimizer and LR scheduler (HF compat)¶

These fields exist for HuggingFace compatibility. If you pass an explicit optimizer_factory / lr_scheduler_factory to the trainer constructor, most of these are ignored.

Field	Type	Default	Description
`learning_rate`	float	5e-5	Initial LR for the default AdamW optimizer.
`weight_decay`	float	0.0	Weight decay for the default AdamW optimizer.
`adam_beta1`	float	0.9	AdamW beta1.
`adam_beta2`	float	0.999	AdamW beta2.
`adam_epsilon`	float	1.0e-8	AdamW epsilon.
`lr_scheduler_type`	str	`"linear"`	`"linear"`, `"cosine"`, `"polynomial"`, etc.
`lr_scheduler_kwargs`	dict \| None	`{}`	Extra kwargs passed to the scheduler.
`warmup_steps`	int	0	Linear warmup from 0 to `learning_rate` over this many steps.

Gradient checkpointing¶

Field	Type	Default	Description
`gradient_checkpointing`	bool	False	Enable activation checkpointing on models that support the HF API. Trades compute for memory.

Customise the enable function by passing enable_activation_checkpoint_fn to the Trainer constructor (see below).

BaseTrainingArguments¶

Adds checkpoint-preservation, AMP/FP8, SDPA backend selection, and runtime PyTorch tweaks. Defined in base_trainer.py.

Default dtype and eval¶

Field	Type	Default	Description
`default_dtype`	str \| None	None	Default dtype used during model construction (`"float32"`, `"bfloat16"`, `"float16"`).
`max_eval_steps`	int	-1	Maximum eval batches per eval run. `-1` = unlimited.

Checkpoint preservation¶

Field	Type	Default	Description
`preserve_best_model`	bool	False	Keep the best-metric checkpoint safe from cleanup.
`best_model_metric`	str	`"loss"`	Metric used for best-model tracking.
`best_model_greater_is_better`	bool \| None	auto	Override direction; inferred from metric name if None.
`preserve_n_best`	int	1	Number of best checkpoints to preserve.
`eval_on_save`	bool	False	Run eval immediately before each save (decouples save/eval schedules).

Memory and debugging¶

Field	Type	Default	Description
`enable_activation_offloading`	bool	False	Offload saved activations to CPU during backward (via `torch.autograd.graph.save_on_cpu`). Best combined with activation checkpointing.
`detect_anomaly`	bool	False	Enable `torch.autograd.set_detect_anomaly(True)` to help track down NaNs. Adds overhead - debug only.

Notes: - enable_activation_offloading appears to be incompatible with flex-attention.

Scaled Dot-Product Attention¶

Field	Type	Default	Description
`sdpa_backend`	str \| list[str] \| None	None	Force SDPA backend: `"math"`, `"flash"`, `"efficient"`, or `"cudnn"` (or a list).
`sdpa_set_priority`	bool	False	When `sdpa_backend` is a list, interpret it as a priority order.

Matmul precision and dynamo¶

Field	Type	Default	Description
`float32_matmul_precision`	str \| None	None	`"highest"`, `"high"` (TF32), or `"medium"`. Passed to `torch.set_float32_matmul_precision`.
`dynamo_recompile_limit`	int \| None	None	Override `torch._dynamo.config.recompile_limit`. Raise if you see frequent recompiles with `torch.compile`.

On any GPU that supports TF32 (Ampere or newer), you usually want float32_matmul_precision = "high". Leaving it at the PyTorch default ("highest") disables TF32 and triggers a runtime warning from torch.set_float32_matmul_precision on modern hardware.

Mixed precision and FP8¶

Field	Type	Default	Description
`mixed_precision`	str \| None	None	`None` / `"no"` disabled, `"bf16"` (no GradScaler), or `"fp16"` (with GradScaler).
`fp8_recipe`	str \| None	None	`"tensorwise"`, `"rowwise"`, or `"rowwise_with_gw_hp"`. Converts `nn.Linear` to `Float8Linear` via torchao. Orthogonal to `mixed_precision`. Mutually exclusive with `qat_recipe`.
`fp8_dim_alignment`	int	16	Minimum alignment for FP8 Linear layer dimensions; non-conforming layers are skipped.
`qat_recipe`	str \| None	None	`"int8-dynamic-act-int4-weight"`, `"int4-weight-only"`, or `"float8-dynamic-act-float8-weight"`. Installs `FakeQuantizedLinear` via torchao QAT (prepare phase). Run `forgather finalize --quantize <recipe>` after training to produce the deployable low-bit artifact. Mutually exclusive with `fp8_recipe`.

FP8 requires CUDA SM >= 8.9 (RTX 4090, H100, etc.). See fp8-training.md. QAT has no hardware gate (runs on any CUDA GPU or CPU); see qat-training.md.

TrainingArguments (simple `Trainer`)¶

Adds memory-optimisation options specific to the simple single-device trainer. Defined in trainer.py.

Field	Type	Default	Description
`gc_threshold`	float	0.5	Ratio of allocated-to-total GPU memory that triggers `empty_cache()` (and possibly `gc.collect()`). See the note below - there is a real tuning trade-off.
`construct_model_on`	str	`"default"`	Where to build the model. `"default"` builds on CPU then moves; `"device"` builds directly on device (faster, but can fail when sharding); `"meta"` builds on meta device and materialises empty tensors on target device (fastest). See the note below on `"meta"` semantics and the pipeline trainer.
`activation_memory_budget`	float \| None	None	Sets `torch._functorch.config.activation_memory_budget`. Requires `torch_compile=True`. See PyTorch activation checkpointing techniques.
`fuse_optim_with_backward`	bool	False	Apply optimizer step from the backward gradient hook so each parameter's gradient is consumed and freed immediately after it's produced. Biggest savings when combined with activation checkpointing. Supported by the basic `Trainer` and `PipelineTrainer`, not by `DDPTrainer` or `AccelTrainer` (they need gradients intact for the all-reduce). Also incompatible with `max_grad_norm` gradient clipping (clipping needs the full gradient vector before any update is applied), `gradient_accumulation_steps > 1`, and fp16 mixed precision.
`speed_metrics_start_step`	int	1	Step at which to start collecting speed metrics. `1` excludes the first step (to skip `torch.compile` warmup). Set higher for longer compile warmups.
`set_dataset_epoch`	bool	True	If the train dataset has `set_epoch(epoch: int)`, call it at the start of each epoch.
`debug_optimizer_groups`	bool	False	Log the parameter-name -> optimizer-group assignments produced from the `optimizer_groups` constructor argument when the optimizer is built. Useful to verify that an `optimizer_groups` mapping picks up the parameters you expect.

Notes on construct_model_on = "meta":

Meta construction produces a model with uninitialised (empty) tensors. Normally you'd load a checkpoint right after construction to fill them. If you set "meta" and no checkpoint is available, the basic trainer falls back to constructing on CPU automatically so you never end up training against uninitialised weights.
PipelineTrainer always builds the model on meta - the field is effectively ignored for that trainer. When no checkpoint is present, rank-0 builds the full model on CPU just long enough to initialise weights, then dispatches the initialised parameters/buffers to each rank's empty stage. With a checkpoint, the meta model is filled directly from the checkpoint with no CPU materialisation step.
Transformers v5 exposes an API for initialising weights after meta construction, which would let us skip the rank-0 CPU materialisation step entirely. This is not wired up yet.

gc_threshold is a real trade-off, not just an "OOM knob":

Too low: empty_cache() / gc.collect() fire frequently. Each call synchronises the CUDA stream and can noticeably slow training.
Too high: the caching allocator holds onto fragmented blocks until you actually OOM.

Start with the default and only lower it if you see OOM from fragmentation (reserved >> allocated). If you lower it and throughput drops, you've gone too far - raise it back toward the OOM edge.

Not supported by AccelTrainer or DDPTrainer:

fuse_optim_with_backward (both trainers assert against it - they need the complete gradient vector after backward for their all-reduce / gradient synchronisation step, and the fused hook would consume and free each gradient before that could happen). PipelineTrainer does support fuse_optim_with_backward.

Not supported by PipelineTrainer:

mixed_precision="fp16" (use "bf16")
torch_compile=True combined with zero-bubble schedules

DDPTrainingArguments¶

Adds DDP-specific options. Defined in ddp/ddp_trainer.py.

Field	Type	Default	Description
`dispatch_batches`	bool	True	When True, rank-0 loads and dispatches every batch. When False, each rank reads its own shard and a `SynchronizedDataLoader` agrees on when to stop. See the note below - False is usually the right default for performance.
`ddp`	DDPArguments	default	Nested DDP wrapper options (see below).
`post_local_sgd`	PostLocalSGDArguments	default	Nested Post-Local-SGD options (see below).

When to set dispatch_batches = True:

Primary: the dataset implementation doesn't support sharding. Rank-0 iterating it centrally is the only way to feed every rank.
Secondary: you want the DDP run to see the exact same sequence of training examples as a single-GPU or pipeline run, so training curves are directly comparable. Sharding gives each rank a different subset, so curves from sharded vs. non-sharded runs are not directly comparable.

Otherwise prefer dispatch_batches = False. Centralised loading adds per-step dispatch overhead and serialises data loading on rank-0; sharded loading parallelises it across ranks.

`DDPArguments` (nested as `ddp.*`)¶

Passed directly to torch.nn.parallel.DistributedDataParallel. See PyTorch DDP docs.

Field	Type	Default	Description
`broadcast_buffers`	bool	True	Broadcast module buffers at each forward.
`init_sync`	bool	True	Synchronise parameters/buffers during init.
`bucket_cap_mb`	int \| None	None	Bucket size in MB for grad allreduce.
`find_unused_parameters`	bool	False	Enable to allow unused parameters at the cost of an extra traversal.
`gradient_as_bucket_view`	bool	True	Memory-saving grad layout optimization.
`static_graph`	bool	False	Enable DDP static-graph optimization.
`skip_all_reduce_unused_params`	bool	False	Skip the allreduce for unused-parameter handling.

`PostLocalSGDArguments` (nested as `post_local_sgd.*`)¶

Post-Local-SGD wraps the optimizer so that each rank takes local steps and periodically averages parameters instead of performing all-reduce every step.

Field	Type	Default	Description
`enabled`	bool	False	Enable Post-Local-SGD.
`start_step`	int	500	Step at which local SGD kicks in (before this, standard all-reduce is used).
`period`	int	4	Average every `period` optimizer steps.
`post_local_gradient_allreduce`	bool	False	Whether to still all-reduce gradients after the switch-over.

FSDP2TrainingArguments¶

Adds FSDP2 (torch.distributed.fsdp.fully_shard) options. Defined in fsdp2/fsdp2_trainer.py.

Field	Type	Default	Description
`dispatch_batches`	bool	True	Same semantics as `DDPTrainingArguments.dispatch_batches`. True = rank-0 loads and dispatches every batch; False = each rank reads its own shard via `SynchronizedDataLoader`.
`fsdp2`	FSDP2Arguments	default	Nested FSDP2 policy options (see below).

fuse_optim_with_backward is not supported - FSDP2 needs gradients intact for the reduce-scatter in its backward hook.

`FSDP2Arguments` (nested as `fsdp2.*`)¶

Passed to fully_shard(). See the PyTorch FSDP2 docs.

Field	Type	Default	Description
`reshard_after_forward`	bool \| int	True	True behaves like ZeRO-3 (reshard parameters after forward, minimum memory, more communication). False behaves like ZeRO-2 (keep params unsharded between forward and backward, lower comm, higher memory). An int N enables ZeRO++ hybrid sharding across N ranks.
`param_dtype`	str \| None	None	FSDP `MixedPrecisionPolicy` parameter dtype (e.g., `"bfloat16"`). None disables FSDP-level mixed precision; the trainer's existing AMP autocast still applies.
`reduce_dtype`	str \| None	None	Gradient reduce-scatter accumulation dtype. Typically `"float32"` when `param_dtype="bfloat16"` for numerical stability.
`buffer_dtype`	str \| None	None	Non-parameter buffer dtype.
`cpu_offload`	bool	False	Enable `CPUOffloadPolicy` - offload parameters (and gradients) to CPU between uses. Dramatically reduces GPU memory at the cost of PCIe transfer time.
`shard_transformer_layers`	bool	True	Apply `fully_shard` layer-by-layer on transformer blocks before the root module. Required for meaningful memory savings - root-only wrapping can't reshard per-layer during forward/backward.
`transformer_layers_path`	str	`"causal_lm.layer_stack.layers"`	Dotted attribute path that resolves to the iterable of transformer blocks. The default matches Forgather's standard causal-LM structure (see `modelsrc/transformer/`). Override for models with a different block-list path; set to an empty/unresolvable path to fall back to root-only sharding.

Checkpointing: model and optimizer state are saved as per-rank DTensor shards (SharingPattern.PER_RANK) via torch.distributed.checkpoint.state_dict.get_model_state_dict / get_optimizer_state_dict with default (sharded) StateDictOptions. The model is saved under key fsdp2_model (not model) so it routes through the CheckpointCoordinator's PER_RANK path rather than the safetensors path, which cannot handle DTensors. Checkpoints are tied to the world size they were saved at - resuming at a different world size is not supported without going through torch.distributed.checkpoint (DCP).

When world_size == 1, the trainer transparently degrades to the single-device path without calling fully_shard.

PipelineTrainingArguments¶

Adds pipeline-parallel options. Defined in pipeline/pipeline_trainer.py.

Field	Type	Default	Description
`n_microbatches`	int	4	Micro-batches each full batch is split into. Must evenly divide `per_device_train_batch_size`. Higher = better pipeline utilisation, but more memory.
`stages_per_rank`	int	1	Pipeline stages per GPU. Only set > 1 with `is_multistage=True` schedules (e.g., `ScheduleZBVZeroBubble`).
`pp_stage_type`	str	`"loop"`	`"loop"` for round-robin stage assignment, `"v"` for V-layout used by `ScheduleZBVZeroBubble`.
`is_multistage`	bool	False	Set True when using a multi-stage scheduler (inherits from `PipelineScheduleMulti`).
`debug_pipeline`	bool	False	Internal: enable DEBUG-level logging.
`debug_split_model`	bool	False	Internal: dump model-split details.
`debug_model_params`	bool	False	Internal: dump per-parameter placement/dtype.
`debug_model_init`	bool	False	Internal: dump init sequence.

See pipeline-parallel.md for a full walkthrough.

Constructor Arguments¶

The dataclass fields above define the training configuration. Each trainer class also takes a handful of non-primitive constructor arguments for the model, optimizer, callbacks, etc. These are passed directly when you instantiate the trainer (or are wired up by the YAML templates).

`BaseTrainer` (abstract)¶

All concrete trainers inherit these keyword arguments:

Argument	Type	Description
`args`	`BaseTrainingArguments` \| dict	Required. Training configuration. Dicts are converted via `dacite`.
`model`	`nn.Module` \| None	Model instance. Either `model` or `model_init` must be set.
`model_init`	`Callable[[], nn.Module]` \| None	Zero-arg factory called to construct the model. Required for trainers that build on meta device (e.g., `PipelineTrainer`).
`data_collator`	callable \| None	Batches a list of dataset items into tensors. Defaults to `torch.utils.data.default_collate` in the simple `Trainer`.
`train_dataset`	dataset \| None	Training dataset or pre-built dataloader.
`eval_dataset`	dataset \| None	Evaluation dataset or pre-built dataloader.
`processing_class`	callable \| None	Tokenizer or feature extractor (saved alongside the model for HF compat).
`callbacks`	list[TrainerCallback] \| None	Callbacks to attach. Defaults to the class's `default_callbacks()`.
`compute_loss_func`	callable \| None	Optional fused or custom loss function. Required when using `gradient_accumulation_steps > 1`.

`Trainer` (simple, single-device)¶

Adds:

Argument	Type	Description
`distributed_env`	`DistributedEnvInterface`	Required. Provides rank / world-size / device info. Single-device subclasses assert `world_size == 1`.
`optimizer_factory`	`Callable[[params], Optimizer]`	Builds the optimizer from model parameters. If omitted and `optimizer_cls_and_kwargs` is also None, a default AdamW is built from `args`.
`optimizer_cls_and_kwargs`	`tuple[type, dict]`	HF-Trainer-compatible alternative to `optimizer_factory`.
`optimizer_groups`	`Mapping[str, Mapping \\| None]` \| None	Optional regex-to-group mapping used to assign parameters to named optimizer param groups with per-group hyperparameter overrides. See Parameter groups via `optimizer_groups` below.
`lr_scheduler_factory`	`Callable[[Optimizer], LRScheduler]`	Builds the LR scheduler from the optimizer.
`enable_activation_checkpoint_fn`	`Callable[[int, nn.Module], None]`	Called to enable activation checkpointing on the model. Defaults to `enable_hf_activation_checkpointing`.
`fused_loss_factory`	`Callable[[nn.Module], LossFunction]`	Builds a loss function that shares state with the model (e.g., fused cross entropy).

Parameter groups via `optimizer_groups`¶

optimizer_groups lets you split a model's parameters into named groups, each matched by a regex against the fully-qualified parameter name, and each contributing its own hyperparameter overrides to the optimizer. The mapping has the form:

optimizer_groups = {
    "<group_name>": {
        "regex": r"<pattern>",
        "config": {"<override_key>": <override_value>, ...},
    },
    # ...
}

config may be omitted (or set to None) when a group just carves parameters out without changing hyperparameters. Setting an entry's value to None removes the group entirely - useful for cancelling an inherited group from a child template without replacing the whole mapping.

Each parameter is assigned to the first group whose regex matches (dict insertion order), so more-specific patterns should come first. Parameters that match no group fall through to an implicit default group with no overrides - every parameter is guaranteed to end up somewhere. Empty groups are dropped from the result.

The canonical use case is excluding biases, norms, and embeddings from weight decay:

[optimizer_groups]
optimizer_groups: &optimizer_groups
    no_decay:
        regex: 'norm|bias|embed|lm_head'
        config:
            weight_decay: 0.0

Named entries are deliberate: they let a child template override an individual group by re-declaring its name (and replacing the entire spec), rather than having to re-specify every group in the mapping. Set debug_optimizer_groups: True in the trainer args to log each parameter -> group assignment when the optimizer is built - useful for verifying that the regex picks up the parameters you expected.

Supported by Trainer, DDPTrainer, AccelTrainer, FSDP2Trainer, and PipelineTrainer.

`AccelTrainer`¶

Extends Trainer, adds:

Argument	Type	Description
`accelerator`	`accelerate.Accelerator`	Required. Pre-configured Accelerate object. AMP/FP8 and mixed-precision should be configured on the Accelerator, not via `args.mixed_precision` (which is ignored with a warning). `fuse_optim_with_backward` is not supported (Accelerate needs gradients intact for its synchronisation step).

args.gradient_accumulation_steps is reconciled with accelerator.gradient_accumulation_steps at init time (Accelerator wins, with a warning).

`DDPTrainer`¶

Extends Trainer, adds:

Argument	Type	Description
`fused_loss_factory`	`Callable[[nn.Module], LossFunction]`	Same as `Trainer`, but passed through explicitly.

No new required arguments; DDP wrapping is done internally using the DDPArguments in args.ddp. fuse_optim_with_backward is not supported: DDP needs the complete gradient vector after backward for the all-reduce, and the fused hook would consume and free each gradient before that could happen.

When world_size == 1, the trainer transparently degrades to the single-device path without wrapping in DDP.

`FSDP2Trainer`¶

Extends Trainer. Same keyword arguments as Trainer; no FSDP2-specific constructor parameters. fully_shard wrapping happens inside _prepare_model(), before the optimizer is built, so the optimizer is constructed from the already-sharded DTensor parameters.

When world_size == 1, the trainer transparently degrades to the single-device path without calling fully_shard.

`PipelineTrainer`¶

Extends Trainer, adds:

Argument	Type	Description
`model_splitter`	`ModelSplitter`	Required. Function that splits the model (built on meta device) into stage modules and `PipelineStage` objects. See `model_splitter.py`.
`pipe_schedule_factory`	scheduler class	Scheduler class from `torch.distributed.pipelining`, e.g., `ScheduleGPipe`, `Schedule1F1B`, `ScheduleInterleaved1F1B`, `ScheduleInterleavedZeroBubble`, `ScheduleZBVZeroBubble`. Default: `ScheduleGPipe`.

Notes and constraints:

model must be None - pipeline must construct via model_init.
per_device_train_batch_size and per_device_eval_batch_size must be evenly divisible by n_microbatches.
dataloader_drop_last is forced to True (pipeline needs fixed shapes).
mixed_precision="fp16" is rejected; use "bf16".
torch_compile=True with zero-bubble schedulers is rejected at init time.
world_size must be > 1.

Source of Truth¶

The definitive documentation lives next to the code:

src/forgather/ml/trainer/trainer_types.py - MinimalTrainingArguments
src/forgather/ml/trainer/base_trainer.py - BaseTrainingArguments, BaseTrainer
src/forgather/ml/trainer/trainer.py - TrainingArguments, Trainer
src/forgather/ml/trainer/accelerate/accel_trainer.py - AccelTrainingArguments, AccelTrainer
src/forgather/ml/trainer/ddp/ddp_trainer.py - DDPTrainingArguments, DDPTrainer, DDPArguments, PostLocalSGDArguments
src/forgather/ml/trainer/fsdp2/fsdp2_trainer.py - FSDP2TrainingArguments, FSDP2Trainer, FSDP2Arguments
src/forgather/ml/trainer/pipeline/pipeline_trainer.py - PipelineTrainingArguments, PipelineTrainer

When updating an option, update the dataclass comment first and then synchronise the table above.

Trainer Options Reference¶

Training-Argument Hierarchy¶

MinimalTrainingArguments¶

Output and execution¶

Steps and epochs¶

Batching and data loading¶

Eval / logging cadence¶

torch.compile¶

Checkpointing¶

Optimizer and LR scheduler (HF compat)¶

Gradient checkpointing¶

BaseTrainingArguments¶

Default dtype and eval¶

Checkpoint preservation¶

Memory and debugging¶

Scaled Dot-Product Attention¶

Matmul precision and dynamo¶

Mixed precision and FP8¶

TrainingArguments (simple Trainer)¶

DDPTrainingArguments¶

DDPArguments (nested as ddp.*)¶

PostLocalSGDArguments (nested as post_local_sgd.*)¶

FSDP2TrainingArguments¶

FSDP2Arguments (nested as fsdp2.*)¶

PipelineTrainingArguments¶

Constructor Arguments¶

BaseTrainer (abstract)¶

Trainer (simple, single-device)¶

Parameter groups via optimizer_groups¶

AccelTrainer¶

DDPTrainer¶

FSDP2Trainer¶

PipelineTrainer¶

Source of Truth¶

TrainingArguments (simple `Trainer`)¶

`DDPArguments` (nested as `ddp.*`)¶

`PostLocalSGDArguments` (nested as `post_local_sgd.*`)¶

`FSDP2Arguments` (nested as `fsdp2.*`)¶

`BaseTrainer` (abstract)¶

`Trainer` (simple, single-device)¶

Parameter groups via `optimizer_groups`¶

`AccelTrainer`¶

`DDPTrainer`¶

`FSDP2Trainer`¶

`PipelineTrainer`¶