Skip to content

Forgather Documentation

Forgather is a configuration-driven ML framework that uses template inheritance and code generation to eliminate configuration duplication and enable systematic experimentation. Instead of copying and modifying entire training scripts, you inherit from base templates and specify only what changes.

Most research ML codebases accrete: one training script becomes ten, each a near-copy with subtle differences. Every variation is expensive to try. Small bugs — a loss function wired wrong, a scheduler silently reset on resume, a CLI flag that never reached the tokenizer — hide across forks.

Source code and examples: github.com/jdinalt/forgather

Quick Navigation

Start here:

  • Installation - Host venv (pip / uv) or the Docker images
  • Docker images - Full reference for the dev and runtime (distributable) images: CLI flags, env vars, multi-node operation, troubleshooting. Docker is the recommended install path on Linux
  • Getting Started - First training run, key CLI commands, and the web UI tour
  • Forgather Server Walkthrough - End-to-end tour of the web UI from a fresh install to chatting with a freshly-trained model
  • Forgather Server Reference - CLI flags, config-file syntax, persistent state, full API and panel reference
  • Core Concepts - Configuration pipeline, projects, templates, trainers
  • Release Notes - Per-release change summaries; current release is 1.2.1

Configuration:

Training:

  • Trainer Options Reference - Every training-argument field and constructor parameter across all built-in trainers
  • Pipeline Parallel - Pipeline parallelism for consumer GPUs and limited interconnects
  • Multi-node Training - Practical setup, submit flow, and hang diagnosis for training across multiple machines on a LAN
  • Trainer Control - External control of running training jobs (save, stop, abort)
  • Training Performance Metrics - Token throughput, FLOP tracking, and MFU
  • DiLoCo - Distributed Local-SGD training across heterogeneous machines on LAN
  • FP8 Training - FP8 training via torchao
  • QAT Training - Quantization-aware training via torchao; pair with forgather finalize --quantize (also works alone as post-training quantization)
  • Checkpointing - Distributed checkpoint system for multi-GPU and multi-node training
  • Torch Titan Integration - Forgather integration with PyTorch's Torch Titan training framework
  • Adafactor Triton Performance - Performance analysis for the Triton-optimized Adafactor kernel
  • Distributed Eval: Zero Batches - Diagnose and fix the "produced zero batches" / "did not yield any examples" eval errors in DDP/distributed training

Models and inference:

  • Model Architecture - Transformer module inventory, composition patterns, and optimization flags
  • Model Conversion - Bidirectional HuggingFace / Forgather model conversion
  • Update Model - Migrate a saved Forgather model to newer Forgather sources via versioned config + state_dict migrations; preserves saved hyperparameters
  • Finalize Model - Build a clean handoff directory after pre-training: source + tokenizer + chat template + generation_config + a single preserved checkpoint
  • Add-Tokens Config - YAML format for --add-tokens (ChatML / new EOS / pad)
  • EOS Tokens and generate() Stopping Criteria - Theory of operation: how HF's generate() resolves stopping across multiple EOS-bearing files
  • vLLM Integration - Distributed inference with vLLM (currently blocked on Transformers v5)

Guides:

Operations:

  • TLS - Enable HTTPS for forgather server, dataset_server, and inference_server off a single per-host CA + cert. Single-host bring-up, cluster cert distribution, renewal, Docker runtime integration, command reference, threat model.

Tutorials

  • Tiny Llama - Demonstration of basic usage
  • Projects Overview - Learn about the Forgather Project abstraction
  • Project Composition - How the template system works
  • Dynamic LM - Demonstrates how models are dynamically composed
  • Samantha - Demonstrates how to use Forgather to finetune a 7B parameter model on the Samantha dataset
  • H.P. Lovecraft Project - Learn how to create workspaces and projects, while training a model to summon the Elder Gods
Journey Project
Pretrain from scratch pretrain/small-llm
Fine-tune a 7B model (multi-GPU) finetune/samantha
Instruction / reasoning fine-tune finetune/open-orca
Long-context fine-tuning + RoPE recipes tutorials/hp_lovecraft_project
Cut peak memory tiny_experiments/peak_memory
Pick an optimizer tiny_experiments/optimizers
Pipeline-parallel recipes tiny_experiments/pipeline_parallel
Decentralised / bandwidth-limited training tiny_experiments/diloco

pretrain/small-llm — A 162M-parameter Llama trained from scratch on the SmolLM corpus (FineWeb-Edu + Cosmopedia) with packed sequences and Flex Attention. Ten production-ready configs covering 1× and 10× Chinchilla budgets, AdamW / Adafactor / bf16 variants. Includes reproducible Chinchilla scaling-law plots.

finetune/samantha — Fine-tune Mistral-7B or Llama-3.2-1B on the Samantha conversational dataset across every trainer backend. Configs cover single-GPU, 2/4-GPU pipeline parallel, FSDP-2, and DDP. Documented throughput (~8.9K tok/s on 4× RTX 4090). The most-referenced finetune project in the library.

finetune/open-orca — Instruction and reasoning fine-tune on Open-Orca with ChatML-formatted evaluation prompts covering chain-of-thought math, logic puzzles, reading comprehension, and summarisation. 1B Llama 3.2 on a 1B-token budget completes in ~11 hours on 4× RTX 4090.

tutorials/hp_lovecraft_project — Fine-tune Mistral-7B / Llama-2-7B on the complete works of H.P. Lovecraft on a single 24 GB GPU, with up to 53K tokens of context. Includes a four-way RoPE comparison (plain, YaRN, Llama-3 NTK-by-parts, bumped θ) evaluating 8K-trained models out to 16K.

tiny_experiments/peak_memory — A systematic 9-way ablation of memory-optimisation techniques on a 1.6B model: BF16, activation checkpointing, torch.compile, fused optimizer step, activation-memory budget. Headline: 81% peak-memory reduction at ~2.7× throughput over the unoptimised baseline.

tiny_experiments/optimizers — Empirical comparison of ten optimizers (Muon, Apollo, AdamW, Adafactor, SinkGD, SGD, and more) on a 30M Llama. Headline: Muon wins at small batch (eval loss 2.6778 vs AdamW 2.7392). Includes per-optimizer memory and throughput tiers.

tiny_experiments/pipeline_parallel — Test harness and reference configs for PyTorch's pipeline-parallel schedules (GPipe, 1F1B, ZBV, interleaved), with checkpoint save/resume coverage across 2/4-GPU setups.

tiny_experiments/diloco — DiLoCo (distributed local SGD) on a 4M-parameter model. Pseudo-gradient compression, streaming-fragment overlap with the backward pass, sync and async modes. The lowest-communication-bandwidth trainer in the library.

Example Project Collections

  • Tiny Experiments - A collection of experiments and integration tests using (mostly) small models
  • Dataset Projects - A collection of demonstration dataset configurations
  • Finetune - A collection of finetuning examples
  • Tokenizers - Tokenizer definition examples
  • Models - Example model definitions

Development

Getting Help

Documentation Structure

docs/
├── getting-started/     # Installation and first training run
├── core-concepts/       # Configuration pipeline, projects, templates
├── configuration/       # Template and configuration system
├── project-templates/   # Reusable project templates (LM Training, Auto LR)
├── trainers/            # Training system (PP, DiLoCo, control, metrics, FP8)
├── checkpointing/       # Distributed checkpoint system
├── datasets/            # Data loading, packing, and preprocessing
├── inference/           # vLLM integration guide
├── guides/              # How-to guides (models, datasets, CLI, conversion)
├── development/         # Testing and development workflow
├── fused_loss/          # Fused linear cross-entropy loss