Skip to content

FSDP2 Trainer Integration Test

Minimal integration test project for the FSDP2 (fully_shard) trainer. Parallels the ddp_trainer example but exercises the FSDP2-specific code paths: layer-wise fully_shard wrapping, sharded DTensor state_dict save/load, and CPUOffloadPolicy.

Trainer implementation: src/forgather/ml/trainer/fsdp2/fsdp2_trainer.py

Configurations

Config Purpose
2gpu.yaml Default 2-GPU smoke test — verifies init, forward/backward, and gradient reduction
checkpoint_train.yaml Trains 500 steps saving per-rank sharded checkpoints every 200 steps
checkpoint_resume.yaml Resumes from the sharded checkpoint and continues to 1000 steps
cpu_offload.yaml Enables fsdp2.cpu_offload=True (CPUOffloadPolicy)

Usage

# 2-GPU smoke test
forgather -t 2gpu.yaml train

# Sharded checkpoint round-trip
forgather -t checkpoint_train.yaml train
forgather -t checkpoint_resume.yaml train

# CPU offload
forgather -t cpu_offload.yaml train

# View logs
forgather logs summary --all --format one-line
forgather logs plot --loss-curves

Notes

  • Sharded checkpoints: the model and optimizer are saved with SharingPattern.PER_RANK — each rank writes its own file with get_model_state_dict(model) / get_optimizer_state_dict(model, optim). The resulting checkpoints are tied to the world size they were saved at; resuming at a different world size is not supported in this first cut and would require routing through torch.distributed.checkpoint.
  • Layer-wise sharding: the default fsdp2.transformer_layers_path="model.layer_stack.layers" matches the standard Forgather transformer (modelsrc/transformer/). Override in a child config if your model uses a different block-list path.
  • The tiny.yaml base project uses a ~4M-parameter model on Tiny Stories — small enough that FSDP2 offers no real memory savings; this project exists solely to validate that the code paths work.