Tiny Experiments¶
A collection of small-scale experiments and integration tests built around the TinyStories dataset. Because a ~4M parameter model trained for one epoch on TinyStories takes only a few minutes on a single GPU, this workspace is well-suited for quickly validating trainer features, comparing architectures, and benchmarking training techniques without committing to a long run. Projects in this collection serve dual roles: as working examples of Forgather features and as regression tests for the library itself.
The workspace shares a common template library and base project configuration (forgather_workspace/) so that individual projects only need to declare what differs from the common defaults.
Model Architectures¶
- tiny_models - Train and compare small (~4M parameter) causal LM architectures — Vanilla Transformer, Llama, DeepOne, Mistral, Qwen3, and Llama+Canon — under identical conditions with a shared BPE tokenizer.
- canon - Experiments with Canon layers (depthwise causal 1D convolutions inserted into transformer blocks) from Physics of Language Models: Part 4.1. Investigates how Canon interacts with and partially substitutes for positional encoding.
Trainers¶
- compare_trainers - Side-by-side comparison of Forgather's built-in trainer, the Accelerate-based trainer, and the HuggingFace Transformers trainer on the same workload.
- ddp_trainer - Integration tests and usage examples for the DDP trainer: distributed checkpointing, dataset distribution strategies, and gradient accumulation across data-parallel ranks.
- fsdp2_trainer - Integration test for the FSDP2 (
fully_shard) trainer. Exercises layer-wise sharding, sharded DTensor checkpoint save/resume, and CPU offload. - pipeline_parallel - Test harness for the Pipeline Parallel trainer covering all supported PyTorch pipeline schedules, checkpoint save/resume, and activation checkpointing.
- diloco - Demonstrates DiLoCo (Distributed Local-SGD) training via
DiLoCoCallback: multiple independent workers synchronize outer gradients at configurable intervals over standard network interfaces.
Training Techniques¶
- checkpointing - Demonstrates optimizer and learning-rate scheduler checkpoint save/restore, automatic checkpoint discovery, and robust resume logic across multiple training sessions.
- grad_accumulation - Validates that gradient accumulation produces equivalent training curves to proportionally larger batch sizes across the standard, pipeline parallel, and Accelerate DDP trainers.
- peak_memory - Systematic comparison of memory optimization techniques on a 1.6B parameter model: bfloat16, activation checkpointing,
torch.compile, fused optimizer kernels, and combinations thereof. - optimizers - Benchmarks ten optimizer configurations (AdamW, SGD variants, AMP, pure bfloat16, gradient accumulation) on a 30M parameter Llama model, examining whether adaptive optimizers are necessary for small-batch LLM pre-training.
- sinkgd - Hyperparameter sweep for the Sinkhorn GD optimizer, a rounding-aware optimizer that applies doubly-stochastic normalization to weight matrices.
Scaffold¶
- tiny_experiment - Minimal project that exercises the default
projects/tiny.yamlbase template. Used to validate that the base project templates work correctly and as a starting point for new ablations.