Pipeline Parallel Training¶
Test harness for the Pipeline Parallel trainer, covering all supported PyTorch pipeline schedules, checkpoint save/resume, and activation checkpointing.
For documentation on the PipelineTrainer module, see docs/trainers/pipeline-parallel.md.
Configurations¶
Baseline (no pipeline):
| Config | Description |
|---|---|
control.yaml |
Standard trainer, single GPU -- baseline for comparison |
Single-stage schedules (stages_per_rank=1):
| Config | GPUs | Schedule | Description |
|---|---|---|---|
tiny_gpipe_2gpu.yaml |
2 | GPipe | Quick debug test with tiny model |
gpipe_2gpu.yaml |
2 | GPipe | GPipe on 2 GPUs |
gpipe_4gpu.yaml |
4 | GPipe | GPipe on 4 GPUs |
1f1b_4gpu.yaml |
4 | 1F1B | 1-forward-1-backward on 4 GPUs |
Multi-stage schedules (stages_per_rank=2):
| Config | GPUs | Schedule | Description |
|---|---|---|---|
interleaved_1f1b_4gpu.yaml |
4 | Interleaved 1F1B | 2 stages per rank, interleaved |
looped_bfs_4gpu.yaml |
4 | Looped BFS | Breadth-first scheduling |
zero_bubble_4gpu.yaml |
4 | Interleaved Zero Bubble | Minimal pipeline bubble |
zbvz_4gpu.yaml |
4 | ZBV Zero Bubble | V-pattern stage assignment |
Activation checkpointing:
| Config | Description |
|---|---|
gpipe_4gpu_activation_checkpoint.yaml |
GPipe with gradient checkpointing enabled |
Checkpoint save/resume:
| Config | Description |
|---|---|
checkpoint_test_train.yaml |
Train with frequent checkpoints (save_steps: 50) |
checkpoint_test_resume.yaml |
Resume from latest checkpoint |
Cross-architecture:
| Config | Description |
|---|---|
tiny_llama_gpipe_3gpu.yaml |
Llama model (instead of default causal) on 3 GPUs |
Usage¶
# Run a 4-GPU pipeline test
forgather -t gpipe_4gpu.yaml train -d 0,1,2,3
# Compare schedules
forgather -t 1f1b_4gpu.yaml train -d 0,1,2,3
forgather -t interleaved_1f1b_4gpu.yaml train -d 0,1,2,3
# Test checkpoint round-trip
forgather -t checkpoint_test_train.yaml train -d 0,1
forgather -t checkpoint_test_resume.yaml train -d 0,1