Skip to content

Checkpoint Testing Project

This project demonstrates the new optimizer and learning rate scheduler checkpoint functionality in Forgather.

Features

  • Optimizer State Checkpointing: Save and restore optimizer state including momentum and other parameter-specific state
  • LR Scheduler State Checkpointing: Maintain learning rate schedules across training interruptions
  • Automatic Checkpoint Discovery: Find the most recent checkpoint by modification time
  • Robust Resume Logic: Handle multiple training sessions with out-of-order checkpoint names

Project Structure

checkpointing/
├── README.md
├── meta.yaml                    # Project metadata
├── project_index.ipynb          # Interactive demo and documentation
└── templates/
    ├── project.yaml            # Base project configuration
    └── configs/
        ├── train.yaml          # Initial training with checkpointing
        └── resume.yaml         # Resume from latest checkpoint

Quick Start

1. Initial Training

Train a model with checkpointing enabled:

cd examples/tiny_experiments/checkpointing
python ../../../bin/forgather -t train.yaml train -d 0

This should: - Train for 500 steps (currently using full epoch due to configuration inheritance) - Save checkpoints every 100 steps - Save optimizer and scheduler state with each checkpoint

2. Resume Training

Continue from the latest checkpoint:

python ../../../bin/forgather -t resume.yaml train -d 0

This should: - Automatically find the latest checkpoint - Restore model, optimizer, and scheduler state - Continue training from step 500 to 800

Current Status

The checkpoint functionality has been successfully implemented and tested with unit tests, but the project configuration needs refinement to properly override the inherited training settings. The core checkpoint features work as demonstrated by the comprehensive test suite.

Configuration Options

The checkpoint functionality is controlled by these training arguments:

Option Description Default
save_optimizer_state Save optimizer state in checkpoints true
save_scheduler_state Save LR scheduler state in checkpoints true
restore_optimizer_state Restore optimizer state when resuming true
restore_scheduler_state Restore scheduler state when resuming true
resume_from_checkpoint true (auto-find) or path string false
save_total_limit Maximum checkpoints to keep 3

Implementation Details

Checkpoint Discovery

  • Uses file modification time rather than step numbers
  • Robust across multiple training sessions
  • Validates checkpoints before selection

State Management

  • Optimizer state includes momentum, variance estimates, etc.
  • Scheduler state preserves step counts and learning rate schedules
  • Graceful handling of missing state files

Multi-Trainer Support

  • Works with Trainer, AccelTrainer, and PipelineTrainer
  • Proper synchronization in distributed training
  • Trainer-specific optimizations for state handling

Testing

The project includes comprehensive unit tests in tests/unit/ml/test_checkpoints.py covering: - Checkpoint validation and discovery - State saving and loading - Configuration option handling - Integration with trainer workflow

Run tests with:

pytest tests/unit/ml/test_checkpoints.py -v