Forgather Documentation¶
Forgather is a configuration-driven ML framework that uses template inheritance and code generation to eliminate configuration duplication and enable systematic experimentation. Instead of copying and modifying entire training scripts, you inherit from base templates and specify only what changes.
Most research ML codebases accrete: one training script becomes ten, each a near-copy with subtle differences. Every variation is expensive to try. Small bugs — a loss function wired wrong, a scheduler silently reset on resume, a CLI flag that never reached the tokenizer — hide across forks.
Source code and examples: github.com/jdinalt/forgather
Quick Navigation¶
Start here:
- Installation - Host venv (pip / uv) or the Docker images
- Docker images - Full reference for the dev and runtime (distributable) images: CLI flags, env vars, multi-node operation, troubleshooting. Docker is the recommended install path on Linux
- Getting Started - First training run, key CLI commands, and the web UI tour
- Forgather Server Walkthrough - End-to-end tour of the web UI from a fresh install to chatting with a freshly-trained model
- Forgather Server Reference - CLI flags, config-file syntax, persistent state, full API and panel reference
- Core Concepts - Configuration pipeline, projects, templates, trainers
- Release Notes - Per-release change summaries; current release is 1.2.1
Configuration:
- Configuration Overview - Template system and YAML configuration
- Syntax Reference - Complete syntax reference for tags and directives
- Model Initialization - Regex-based parameter initialization
- Project Templates - LM Training and Auto LR project templates
- High-level API - The "Project" abstraction
- Low-level API - The API upon which the Project abstraction is built
Training:
- Trainer Options Reference - Every training-argument field and constructor parameter across all built-in trainers
- Pipeline Parallel - Pipeline parallelism for consumer GPUs and limited interconnects
- Multi-node Training - Practical setup, submit flow, and hang diagnosis for training across multiple machines on a LAN
- Trainer Control - External control of running training jobs (save, stop, abort)
- Training Performance Metrics - Token throughput, FLOP tracking, and MFU
- DiLoCo - Distributed Local-SGD training across heterogeneous machines on LAN
- FP8 Training - FP8 training via torchao
- QAT Training - Quantization-aware training via torchao; pair with
forgather finalize --quantize(also works alone as post-training quantization) - Checkpointing - Distributed checkpoint system for multi-GPU and multi-node training
- Torch Titan Integration - Forgather integration with PyTorch's Torch Titan training framework
- Adafactor Triton Performance - Performance analysis for the Triton-optimized Adafactor kernel
- Distributed Eval: Zero Batches - Diagnose and fix the "produced zero batches" / "did not yield any examples" eval errors in DDP/distributed training
Models and inference:
- Model Architecture - Transformer module inventory, composition patterns, and optimization flags
- Model Conversion - Bidirectional HuggingFace / Forgather model conversion
- Update Model - Migrate a saved Forgather model to newer Forgather sources via versioned config + state_dict migrations; preserves saved hyperparameters
- Finalize Model - Build a clean handoff directory after pre-training: source + tokenizer + chat template + generation_config + a single preserved checkpoint
- Add-Tokens Config - YAML format for
--add-tokens(ChatML / new EOS / pad) - EOS Tokens and
generate()Stopping Criteria - Theory of operation: how HF'sgenerate()resolves stopping across multiple EOS-bearing files - vLLM Integration - Distributed inference with vLLM (currently blocked on Transformers v5)
Guides:
- Forgather Server Walkthrough - End-to-end tour of the web UI: install through training a small model and chatting with it
- Forgather Server Reference - Full feature + API reference for the server's panels, modals, and endpoints
- Creating a Model Project - Define a custom model architecture from scratch
- Model CLI Reference -
forgather modelcommand: construct, test, checkpoint, and use models - Creating a Dataset Project - Load, pack, and interleave HuggingFace datasets
- Dataset CLI Reference -
forgather datasetcommand: inspect, sample, and histogram datasets - Dataset Server - Multi-node training: serve HF cache + named local datasets via
FORGATHER_DATASET_SERVER - Working with Tokenizer Projects - CLI commands for tokenizer projects
- Debugging Configuration Errors - Systematic troubleshooting and common error patterns
- Interactive CLI - Interactive shell with tab completion and editor integration
- Evaluating Models - Loss/perplexity evaluation via
forgather eval - Log Analysis - Training log summaries, plots, and heatmaps
- TensorBoard - Launch TensorBoard against a model's
runs/directory from the webui orforgather tb - MkDocs - Serve the bundled Forgather docs locally with live-reload via the Services menu or
forgather mkdocs
Operations:
- TLS - Enable HTTPS for
forgather server,dataset_server, andinference_serveroff a single per-host CA + cert. Single-host bring-up, cluster cert distribution, renewal, Docker runtime integration, command reference, threat model.
Tutorials¶
- Tiny Llama - Demonstration of basic usage
- Projects Overview - Learn about the Forgather Project abstraction
- Project Composition - How the template system works
- Dynamic LM - Demonstrates how models are dynamically composed
- Samantha - Demonstrates how to use Forgather to finetune a 7B parameter model on the Samantha dataset
- H.P. Lovecraft Project - Learn how to create workspaces and projects, while training a model to summon the Elder Gods
Featured Examples¶
| Journey | Project |
|---|---|
| Pretrain from scratch | pretrain/small-llm |
| Fine-tune a 7B model (multi-GPU) | finetune/samantha |
| Instruction / reasoning fine-tune | finetune/open-orca |
| Long-context fine-tuning + RoPE recipes | tutorials/hp_lovecraft_project |
| Cut peak memory | tiny_experiments/peak_memory |
| Pick an optimizer | tiny_experiments/optimizers |
| Pipeline-parallel recipes | tiny_experiments/pipeline_parallel |
| Decentralised / bandwidth-limited training | tiny_experiments/diloco |
pretrain/small-llm — A 162M-parameter Llama trained from scratch on the SmolLM corpus (FineWeb-Edu + Cosmopedia) with packed sequences and Flex Attention. Ten production-ready configs covering 1× and 10× Chinchilla budgets, AdamW / Adafactor / bf16 variants. Includes reproducible Chinchilla scaling-law plots.
finetune/samantha — Fine-tune Mistral-7B or Llama-3.2-1B on the Samantha conversational dataset across every trainer backend. Configs cover single-GPU, 2/4-GPU pipeline parallel, FSDP-2, and DDP. Documented throughput (~8.9K tok/s on 4× RTX 4090). The most-referenced finetune project in the library.
finetune/open-orca — Instruction and reasoning fine-tune on Open-Orca with ChatML-formatted evaluation prompts covering chain-of-thought math, logic puzzles, reading comprehension, and summarisation. 1B Llama 3.2 on a 1B-token budget completes in ~11 hours on 4× RTX 4090.
tutorials/hp_lovecraft_project — Fine-tune Mistral-7B / Llama-2-7B on the complete works of H.P. Lovecraft on a single 24 GB GPU, with up to 53K tokens of context. Includes a four-way RoPE comparison (plain, YaRN, Llama-3 NTK-by-parts, bumped θ) evaluating 8K-trained models out to 16K.
tiny_experiments/peak_memory — A systematic 9-way ablation of memory-optimisation techniques on a 1.6B model: BF16, activation checkpointing, torch.compile, fused optimizer step, activation-memory budget. Headline: 81% peak-memory reduction at ~2.7× throughput over the unoptimised baseline.
tiny_experiments/optimizers — Empirical comparison of ten optimizers (Muon, Apollo, AdamW, Adafactor, SinkGD, SGD, and more) on a 30M Llama. Headline: Muon wins at small batch (eval loss 2.6778 vs AdamW 2.7392). Includes per-optimizer memory and throughput tiers.
tiny_experiments/pipeline_parallel — Test harness and reference configs for PyTorch's pipeline-parallel schedules (GPipe, 1F1B, ZBV, interleaved), with checkpoint save/resume coverage across 2/4-GPU setups.
tiny_experiments/diloco — DiLoCo (distributed local SGD) on a 4M-parameter model. Pseudo-gradient compression, streaming-fragment overlap with the backward pass, sync and async modes. The lowest-communication-bandwidth trainer in the library.
Example Project Collections¶
- Tiny Experiments - A collection of experiments and integration tests using (mostly) small models
- Dataset Projects - A collection of demonstration dataset configurations
- Finetune - A collection of finetuning examples
- Tokenizers - Tokenizer definition examples
- Models - Example model definitions
Development¶
- API Reference - Auto-generated Python API documentation
- Debugging Guide - Tools and techniques for debugging configurations
- Known Bugs - Known bugs in top-level modules, with corresponding xfail tests
- Testing Guide How to create and run unit tests
- Integration Testing How to create and run integration tests
Getting Help¶
- Documentation Issues: Report documentation problems
- Feature Requests: Request new features
- Questions: Ask questions in discussions
Documentation Structure¶
docs/
├── getting-started/ # Installation and first training run
├── core-concepts/ # Configuration pipeline, projects, templates
├── configuration/ # Template and configuration system
├── project-templates/ # Reusable project templates (LM Training, Auto LR)
├── trainers/ # Training system (PP, DiLoCo, control, metrics, FP8)
├── checkpointing/ # Distributed checkpoint system
├── datasets/ # Data loading, packing, and preprocessing
├── inference/ # vLLM integration guide
├── guides/ # How-to guides (models, datasets, CLI, conversion)
├── development/ # Testing and development workflow
├── fused_loss/ # Fused linear cross-entropy loss