Skip to content

Testing Guide

This document covers how to run and work with the Forgather test suite. It is intended for developers contributing to the project.

Prerequisites

pytest, pytest-cov, and pytest-mock are bundled with the base pip install -e . from the Installation guide — no separate test extra to install. You also need a working PyTorch installation. Some tests additionally require:

  • CUDA -- a GPU with CUDA support (tests that need it are skipped automatically when unavailable)
  • torchrun -- PyTorch's distributed launcher, included with PyTorch
  • torchdata -- the torchdata.stateful_dataloader package (needed by test_fast_hf_loader.py; tests are skipped if missing)

Quick Reference

# Run the full unit test suite
pytest tests/unit/

# Run a specific test group
pytest tests/unit/forgather/
pytest tests/unit/ml/
pytest tests/unit/ml/diloco/

# Run a single test file
pytest tests/unit/ml/test_checkpoints.py

# Run with coverage
pytest tests/unit/ --cov=forgather --cov-report=term-missing

# Integration tests (require GPU -- see docs/development/integration-testing.md)
pytest tests/integration/ -m smoke                        # Smoke test (~40s)
pytest tests/integration/ -m "integration and not slow"   # Training tests (~75s)
pytest tests/integration/ -m integration                  # Full suite incl. inference (~2 min)

# Run distributed tests (require torchrun)
torchrun --nproc_per_node=2 tests/test_checkpoint_integration.py
./tests/run_dataloader_dispatcher_tests.sh

Test Organization

tests/
├── conftest.py                        # Shared fixtures (temp_dir, mock_model, etc.)
├── CHECKPOINT_TESTING.md              # Checkpoint integration test documentation
├── run_dataloader_dispatcher_tests.sh # Shell driver for distributed dataloader tests
├── unit/                              # Standard pytest unit tests
│   ├── forgather/                     # Core framework tests
│   └── ml/                            # ML subsystem tests
│       ├── datasets/                  # Dataset loading and processing
│       └── diloco/                    # Distributed local-compute optimization
├── integration/                       # Integration tests (spec-driven, see integration-testing.md)
│   ├── specs/                         # YAML test specifications
│   ├── test_training.py               # Training project tests
│   └── test_inference.py              # Inference + perplexity tests
├── pipeline_split/                    # Pipeline parallelism tests (torchrun)
├── fixtures/                          # Test fixtures (placeholder)
├── utils/                             # Test utilities (placeholder)
└── [root-level test files]            # Standalone tests, benchmarks, profiling scripts

Test Markers

The following pytest markers are defined in pytest.ini:

Marker Meaning
unit Unit tests
integration Integration tests (end-to-end training projects)
slow Slow-running tests (inference + perplexity, multi-GPU)
smoke Fast smoke tests (subset of integration)

Use -m to filter by marker:

pytest tests/ -m "unit"
pytest tests/ -m "not slow"

Unit Tests (tests/unit/)

Unit tests are the primary test suite. They run with plain pytest, require no GPUs, and mock distributed state where needed. This is what you should run most often during development.

Core Framework (tests/unit/forgather/)

Tests for the configuration, template, and code generation systems:

Area Files What They Test
Configuration test_config.py ConfigText, Config, ConfigDict parsing and handling
Preprocessing test_preprocess.py Jinja2 preprocessing with custom line statement syntax
Code generation test_codegen.py PyEncoder -- generating Python code from configuration objects
YAML test_yaml_encoder.py, test_yaml_utils.py YAML serialization with custom tags (!partial, !singleton, !factory, !var)
Templates test_template_utils.py Template inheritance and inclusion utilities
Utilities test_dotdict.py, test_dynamic.py, test_utils.py Dynamic objects, nested dict access, general utilities
Graph test_graph_encoder.py, test_latent.py Graph representation and encoding

When to run: After modifying anything in the core forgather package -- configuration parsing, template resolution, code generation, or YAML handling.

ML Subsystem (tests/unit/ml/)

Tests for the machine learning infrastructure. These cover training, checkpointing, optimization, model construction, and distributed coordination:

Area Files What They Test
Checkpointing test_checkpoints.py, test_checkpoint_types.py, test_sharded_checkpoint.py Checkpoint saving/loading, state sharing patterns (replicated, per_rank, global), distributed checkpoint coordination, manifest generation
Trainer test_trainer_components.py, test_training_script.py PeriodicFunction, AMPContext, JsonLogger, TrainerState, TrainerControl, training script generation
Distributed test_distributed.py, test_replication_validation.py Rank/world-size utilities, DDP replication validation (uses mocks, no actual distributed init)
Model test_construct.py, test_model_conversion.py, test_resize_embeddings.py, test_no_init_weights.py, test_remap_params.py Model construction, vLLM conversion, embedding resizing, weight remapping
Optimization test_optim_components.py Optimizer construction, scheduling, parameter groups
Loss test_loss.py CausalLoss and ChunkedCausalLoss for memory-efficient large-vocabulary training
Data test_data_collator.py, test_tokenizer.py Data collation for padded sequences, tokenizer utilities
Analysis test_analysis.py Training log parsing and summary statistics
Infrastructure test_file_locking.py, test_memory_monitor.py, test_utils.py File locking for multi-GPU saves, memory tracking, ML utilities

When to run: After modifying trainers, checkpointing, model construction, optimizers, loss functions, or any ML infrastructure code.

Note on CUDA: Most tests in this group run on CPU. A few tests in test_trainer_components.py (mixed-precision / AMP tests), test_optim_components.py, and test_loss.py are gated behind torch.cuda.is_available() and skip automatically on CPU-only machines.

Datasets (tests/unit/ml/datasets/)

File What It Tests
test_dataset_utils.py Dataset utility functions
test_soft_sequential.py Soft sequential dataset interleaving probabilities
test_fast_hf_loader.py Fast HuggingFace dataset loader (requires torchdata.stateful_dataloader)
test_interleaved_checkpoint_bug.py Checkpoint state restoration for interleaved datasets

When to run: After modifying dataset loading, interleaving, or the HuggingFace loader.

DiLoCo (tests/unit/ml/diloco/)

Tests for the Distributed Local-Compute Optimization (DiLoCo) system:

File What It Tests
test_worker.py Pseudo-gradient computation and synchronization logic
test_server.py Outer optimizer (SGD with Nesterov), state serialization
test_server_client.py Client-server communication
test_streaming.py Streaming and buffering during sync
test_async.py Asynchronous operations
test_dashboard.py Monitoring dashboard
test_fault_tolerance.py Failure recovery mechanisms
test_diloco_callback.py DiLoCo training callback integration

When to run: After modifying anything in the DiLoCo subsystem.

Root-Level Tests (tests/)

Root-level test files are a mix of standalone integration tests, benchmarks, and profiling scripts. Some of these run via pytest, but several are standalone scripts that require torchrun or manual execution.

Data Processing and Packing

These test the sequence packing and data pipeline logic. They run with pytest:

pytest tests/test_bin_packing.py
pytest tests/test_packing_comparison.py
pytest tests/test_packing_batch_behavior.py
pytest tests/test_shuffle_output.py
pytest tests/test_qwen3_packing.py
pytest tests/test_document_boundaries.py
File What It Tests
test_bin_packing.py Bin-packing algorithm for fitting documents into fixed-size containers
test_packing_comparison.py Greedy vs. optimized packing strategy efficiency
test_packing_batch_behavior.py Batch-level packing behavior
test_shuffle_output.py Shuffling correctness in packed sequences
test_qwen3_packing.py Packing with the Qwen3 tokenizer (151k vocabulary)
test_document_boundaries.py Document boundary tracking in packed sequences

When to run: After modifying sequence packing, bin packing, or the data collation pipeline.

Other Standalone Tests (pytest-compatible)

pytest tests/test_divergence_detection.py
pytest tests/test_model_equivalence.py
pytest tests/test_optimizer_state_dict.py
File What It Tests
test_divergence_detection.py Loss divergence detection via dual exponential moving average windows
test_model_equivalence.py HuggingFace vs. Forgather model output comparison and weight transfer
test_optimizer_state_dict.py Optimizer state serialization and deserialization

Tests Requiring torchrun (Distributed)

These tests launch multiple processes and require torchrun. They cannot be run with plain pytest.

Checkpoint Integration Test

# Single process (no torchrun needed)
python tests/test_checkpoint_integration.py

# Multi-process DDP simulation
torchrun --nproc_per_node=2 tests/test_checkpoint_integration.py
torchrun --nproc_per_node=4 tests/test_checkpoint_integration.py

# Run specific scenario
python tests/test_checkpoint_integration.py --scenario basic
python tests/test_checkpoint_integration.py --scenario spike
python tests/test_checkpoint_integration.py --scenario all

Tests checkpoint preservation, race-condition fixes, divergence detection, and DDP coordination with synthetic metrics (no actual training). See tests/CHECKPOINT_TESTING.md for full details.

Dataloader Dispatcher Tests

# Run full suite with gloo backend (CPU-only, no GPUs needed)
./tests/run_dataloader_dispatcher_tests.sh

# Run with nccl backend (requires GPUs)
./tests/run_dataloader_dispatcher_tests.sh nccl

Tests multi-dimensional batch dispatching across data-parallel and model-parallel dimensions. The shell script runs multiple configurations: 1D pure DP, 1D pure MP, single-rank edge cases, and 2D hybrid meshes.

Distributed Pipeline Test

# Requires CUDA -- uses nccl backend
torchrun --nnodes 1 --nproc_per_node 2 tests/pipeline_split/test_distributed_pipeline.py
torchrun --nnodes 1 --nproc_per_node 4 tests/pipeline_split/test_distributed_pipeline.py

Tests pipeline parallelism using PyTorch's PipelineStage, including BlockMask attention transport between stages. Requires CUDA.

Distributed build_sync Test

# Uses gloo backend (CPU-only)
torchrun --nproc_per_node 4 --standalone tests/unit/ml/test_build_sync_distributed.py

# Test with local process group
torchrun --nproc_per_node 4 --standalone tests/unit/ml/test_build_sync_distributed.py --local

Tests the build_sync context manager's barrier-based synchronization across ranks.

Tests Requiring CUDA

The following tests or test subsets require a CUDA-capable GPU. When CUDA is unavailable, they are skipped automatically via @pytest.mark.skipif or @unittest.skipUnless decorators.

Selectively skipped tests within pytest-compatible files:

  • tests/unit/ml/test_trainer_components.py -- 8 tests for mixed-precision (AMPContext, FP16State, gradient scaling)
  • tests/unit/ml/test_optim_components.py -- 4 tests for CUDA-specific optimizer behavior
  • tests/unit/ml/test_loss.py -- 2 tests (memory profiling, chunked cross-entropy on GPU)

Standalone scripts that require CUDA:

  • tests/pipeline_split/test_distributed_pipeline.py -- uses nccl backend and torch.cuda.set_device()
  • tests/benchmark_adafactor_triton.py -- Triton kernel benchmarks (GPU-only)
  • tests/test_adafactor_triton.py -- AdaFactor Triton kernel functional tests
  • tests/profile_large_vocab_memory.py -- CUDA memory profiling for large-vocabulary models

Benchmarks and Profiling Scripts

These are not part of the regular test suite. Run them manually when profiling performance:

# AdaFactor Triton kernel benchmark (requires CUDA)
python tests/benchmark_adafactor_triton.py

# Large vocabulary memory profiling (requires CUDA)
python tests/profile_large_vocab_memory.py

# AdaFactor Triton functional tests (requires CUDA)
python tests/test_adafactor_triton.py

Shared Fixtures

tests/conftest.py provides fixtures available to all pytest-based tests:

Fixture Description
temp_dir Creates and cleans up a temporary directory
training_args Default TrainingArguments with sensible test defaults
mock_model A SimpleMockModel (single nn.Linear layer)
mock_dataset A mock dataset with __len__ returning 10
mock_optimizer Factory function that creates an Adam optimizer
mock_scheduler Factory function that creates a StepLR scheduler

Before Submitting a PR

Run the full unit test suite and integration smoke test to catch regressions:

pytest tests/unit/
pytest tests/integration/ -m "integration and not slow"

If your changes touch model code generation or templates:

pytest tests/integration/ -m smoke

If your changes touch data loading or packing:

pytest tests/test_bin_packing.py tests/test_packing_comparison.py tests/test_document_boundaries.py

If your changes touch checkpointing:

pytest tests/unit/ml/test_checkpoints.py tests/unit/ml/test_checkpoint_types.py
python tests/test_checkpoint_integration.py --scenario all

With GPU Access

Run the full suite including CUDA-gated tests:

pytest tests/unit/

# Distributed tests
torchrun --nproc_per_node=2 tests/test_checkpoint_integration.py --scenario all
./tests/run_dataloader_dispatcher_tests.sh
torchrun --nproc_per_node 4 --standalone tests/unit/ml/test_build_sync_distributed.py

If you have multiple GPUs, also run the pipeline test:

torchrun --nnodes 1 --nproc_per_node 2 tests/pipeline_split/test_distributed_pipeline.py

Focused Development

Run only the tests relevant to what you changed:

# Configuration / templates
pytest tests/unit/forgather/

# Trainers and training loop
pytest tests/unit/ml/test_trainer_components.py tests/unit/ml/test_training_script.py

# DiLoCo
pytest tests/unit/ml/diloco/

# Datasets
pytest tests/unit/ml/datasets/

# Optimizers
pytest tests/unit/ml/test_optim_components.py

# Loss functions
pytest tests/unit/ml/test_loss.py

Configuration

Pytest configuration lives in pytest.ini at the project root:

[tool:pytest]
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
addopts =
    -v
    --tb=short
    --strict-markers
    --disable-warnings

--strict-markers means any undeclared marker will cause an error. If you add a new marker, register it in pytest.ini.