FSDP2 Trainer Integration Test¶
Minimal integration test project for the FSDP2 (fully_shard) trainer.
Parallels the ddp_trainer example but exercises the FSDP2-specific
code paths: layer-wise fully_shard wrapping, sharded DTensor
state_dict save/load, and CPUOffloadPolicy.
Trainer implementation: src/forgather/ml/trainer/fsdp2/fsdp2_trainer.py
Configurations¶
| Config | Purpose |
|---|---|
2gpu.yaml |
Default 2-GPU smoke test — verifies init, forward/backward, and gradient reduction |
checkpoint_train.yaml |
Trains 500 steps saving per-rank sharded checkpoints every 200 steps |
checkpoint_resume.yaml |
Resumes from the sharded checkpoint and continues to 1000 steps |
cpu_offload.yaml |
Enables fsdp2.cpu_offload=True (CPUOffloadPolicy) |
Usage¶
# 2-GPU smoke test
forgather -t 2gpu.yaml train
# Sharded checkpoint round-trip
forgather -t checkpoint_train.yaml train
forgather -t checkpoint_resume.yaml train
# CPU offload
forgather -t cpu_offload.yaml train
# View logs
forgather logs summary --all --format one-line
forgather logs plot --loss-curves
Notes¶
- Sharded checkpoints: the model and optimizer are saved with
SharingPattern.PER_RANK— each rank writes its own file withget_model_state_dict(model)/get_optimizer_state_dict(model, optim). The resulting checkpoints are tied to the world size they were saved at; resuming at a different world size is not supported in this first cut and would require routing throughtorch.distributed.checkpoint. - Layer-wise sharding: the default
fsdp2.transformer_layers_path="model.layer_stack.layers"matches the standard Forgather transformer (modelsrc/transformer/). Override in a child config if your model uses a different block-list path. - The
tiny.yamlbase project uses a ~4M-parameter model on Tiny Stories — small enough that FSDP2 offers no real memory savings; this project exists solely to validate that the code paths work.