Forgather Documentation¶

Forgather is a configuration-driven ML framework that uses template inheritance and code generation to eliminate configuration duplication and enable systematic experimentation. Instead of copying and modifying entire training scripts, you inherit from base templates and specify only what changes.

Most research ML codebases accrete: one training script becomes ten, each a near-copy with subtle differences. Every variation is expensive to try. Small bugs — a loss function wired wrong, a scheduler silently reset on resume, a CLI flag that never reached the tokenizer — hide across forks.

Source code and examples: github.com/jdinalt/forgather

Start here:

Installation - Host venv (pip / uv) or the Docker images
Docker images - Full reference for the dev and runtime (distributable) images: CLI flags, env vars, multi-node operation, troubleshooting. Docker is the recommended install path on Linux
Getting Started - First training run, key CLI commands, and the web UI tour
Forgather Server Walkthrough - End-to-end tour of the web UI from a fresh install to chatting with a freshly-trained model
Forgather Server Reference - CLI flags, config-file syntax, persistent state, full API and panel reference
Core Concepts - Configuration pipeline, projects, templates, trainers
Release Notes - Per-release change summaries; current release is 1.2.1

Configuration:

Configuration Overview - Template system and YAML configuration
Syntax Reference - Complete syntax reference for tags and directives
Model Initialization - Regex-based parameter initialization
Project Templates - LM Training and Auto LR project templates
High-level API - The "Project" abstraction
Low-level API - The API upon which the Project abstraction is built

Training:

Trainer Options Reference - Every training-argument field and constructor parameter across all built-in trainers
Pipeline Parallel - Pipeline parallelism for consumer GPUs and limited interconnects
Multi-node Training - Practical setup, submit flow, and hang diagnosis for training across multiple machines on a LAN
Trainer Control - External control of running training jobs (save, stop, abort)
Training Performance Metrics - Token throughput, FLOP tracking, and MFU
DiLoCo - Distributed Local-SGD training across heterogeneous machines on LAN
FP8 Training - FP8 training via torchao
QAT Training - Quantization-aware training via torchao; pair with forgather finalize --quantize (also works alone as post-training quantization)
Checkpointing - Distributed checkpoint system for multi-GPU and multi-node training
Torch Titan Integration - Forgather integration with PyTorch's Torch Titan training framework
Adafactor Triton Performance - Performance analysis for the Triton-optimized Adafactor kernel
Distributed Eval: Zero Batches - Diagnose and fix the "produced zero batches" / "did not yield any examples" eval errors in DDP/distributed training

Models and inference:

Model Architecture - Transformer module inventory, composition patterns, and optimization flags
Model Conversion - Bidirectional HuggingFace / Forgather model conversion
Update Model - Migrate a saved Forgather model to newer Forgather sources via versioned config + state_dict migrations; preserves saved hyperparameters
Finalize Model - Build a clean handoff directory after pre-training: source + tokenizer + chat template + generation_config + a single preserved checkpoint
Add-Tokens Config - YAML format for --add-tokens (ChatML / new EOS / pad)
EOS Tokens and generate() Stopping Criteria - Theory of operation: how HF's generate() resolves stopping across multiple EOS-bearing files
vLLM Integration - Distributed inference with vLLM (currently blocked on Transformers v5)

Guides:

Forgather Server Walkthrough - End-to-end tour of the web UI: install through training a small model and chatting with it
Forgather Server Reference - Full feature + API reference for the server's panels, modals, and endpoints
Creating a Model Project - Define a custom model architecture from scratch
Model CLI Reference - forgather model command: construct, test, checkpoint, and use models
Creating a Dataset Project - Load, pack, and interleave HuggingFace datasets
Dataset CLI Reference - forgather dataset command: inspect, sample, and histogram datasets
Dataset Server - Multi-node training: serve HF cache + named local datasets via FORGATHER_DATASET_SERVER
Working with Tokenizer Projects - CLI commands for tokenizer projects
Debugging Configuration Errors - Systematic troubleshooting and common error patterns
Interactive CLI - Interactive shell with tab completion and editor integration
Evaluating Models - Loss/perplexity evaluation via forgather eval
Log Analysis - Training log summaries, plots, and heatmaps
TensorBoard - Launch TensorBoard against a model's runs/ directory from the webui or forgather tb
MkDocs - Serve the bundled Forgather docs locally with live-reload via the Services menu or forgather mkdocs

Operations:

TLS - Enable HTTPS for forgather server, dataset_server, and inference_server off a single per-host CA + cert. Single-host bring-up, cluster cert distribution, renewal, Docker runtime integration, command reference, threat model.

Tutorials¶

Tiny Llama - Demonstration of basic usage
Projects Overview - Learn about the Forgather Project abstraction
Project Composition - How the template system works
Dynamic LM - Demonstrates how models are dynamically composed
Samantha - Demonstrates how to use Forgather to finetune a 7B parameter model on the Samantha dataset
H.P. Lovecraft Project - Learn how to create workspaces and projects, while training a model to summon the Elder Gods

Featured Examples¶

Journey	Project
Pretrain from scratch	pretrain/small-llm
Fine-tune a 7B model (multi-GPU)	finetune/samantha
Instruction / reasoning fine-tune	finetune/open-orca
Long-context fine-tuning + RoPE recipes	tutorials/hp_lovecraft_project
Cut peak memory	tiny_experiments/peak_memory
Pick an optimizer	tiny_experiments/optimizers
Pipeline-parallel recipes	tiny_experiments/pipeline_parallel
Decentralised / bandwidth-limited training	tiny_experiments/diloco

pretrain/small-llm — A 162M-parameter Llama trained from scratch on the SmolLM corpus (FineWeb-Edu + Cosmopedia) with packed sequences and Flex Attention. Ten production-ready configs covering 1× and 10× Chinchilla budgets, AdamW / Adafactor / bf16 variants. Includes reproducible Chinchilla scaling-law plots.

finetune/samantha — Fine-tune Mistral-7B or Llama-3.2-1B on the Samantha conversational dataset across every trainer backend. Configs cover single-GPU, 2/4-GPU pipeline parallel, FSDP-2, and DDP. Documented throughput (~8.9K tok/s on 4× RTX 4090). The most-referenced finetune project in the library.

finetune/open-orca — Instruction and reasoning fine-tune on Open-Orca with ChatML-formatted evaluation prompts covering chain-of-thought math, logic puzzles, reading comprehension, and summarisation. 1B Llama 3.2 on a 1B-token budget completes in ~11 hours on 4× RTX 4090.

tutorials/hp_lovecraft_project — Fine-tune Mistral-7B / Llama-2-7B on the complete works of H.P. Lovecraft on a single 24 GB GPU, with up to 53K tokens of context. Includes a four-way RoPE comparison (plain, YaRN, Llama-3 NTK-by-parts, bumped θ) evaluating 8K-trained models out to 16K.

tiny_experiments/peak_memory — A systematic 9-way ablation of memory-optimisation techniques on a 1.6B model: BF16, activation checkpointing, torch.compile, fused optimizer step, activation-memory budget. Headline: 81% peak-memory reduction at ~2.7× throughput over the unoptimised baseline.

tiny_experiments/optimizers — Empirical comparison of ten optimizers (Muon, Apollo, AdamW, Adafactor, SinkGD, SGD, and more) on a 30M Llama. Headline: Muon wins at small batch (eval loss 2.6778 vs AdamW 2.7392). Includes per-optimizer memory and throughput tiers.

tiny_experiments/pipeline_parallel — Test harness and reference configs for PyTorch's pipeline-parallel schedules (GPipe, 1F1B, ZBV, interleaved), with checkpoint save/resume coverage across 2/4-GPU setups.

tiny_experiments/diloco — DiLoCo (distributed local SGD) on a 4M-parameter model. Pseudo-gradient compression, streaming-fragment overlap with the backward pass, sync and async modes. The lowest-communication-bandwidth trainer in the library.

Example Project Collections¶

Tiny Experiments - A collection of experiments and integration tests using (mostly) small models
Dataset Projects - A collection of demonstration dataset configurations
Finetune - A collection of finetuning examples
Tokenizers - Tokenizer definition examples
Models - Example model definitions

Development¶

API Reference - Auto-generated Python API documentation
Debugging Guide - Tools and techniques for debugging configurations
Known Bugs - Known bugs in top-level modules, with corresponding xfail tests
Testing Guide How to create and run unit tests
Integration Testing How to create and run integration tests

Getting Help¶

Documentation Issues: Report documentation problems
Feature Requests: Request new features
Questions: Ask questions in discussions

Documentation Structure¶

docs/
├── getting-started/     # Installation and first training run
├── core-concepts/       # Configuration pipeline, projects, templates
├── configuration/       # Template and configuration system
├── project-templates/   # Reusable project templates (LM Training, Auto LR)
├── trainers/            # Training system (PP, DiLoCo, control, metrics, FP8)
├── checkpointing/       # Distributed checkpoint system
├── datasets/            # Data loading, packing, and preprocessing
├── inference/           # vLLM integration guide
├── guides/              # How-to guides (models, datasets, CLI, conversion)
├── development/         # Testing and development workflow
├── fused_loss/          # Fused linear cross-entropy loss