Tiny Models¶

Train and compare different small language model architectures on the TinyStories dataset with a shared 2K-token BPE tokenizer. All models are ~4M parameters, allowing direct comparison of architecture choices under identical conditions.

Configurations¶

Forgather model implementations:

Config	Architecture	Key features
`tiny_causal.yaml`	Vanilla Transformer	Basic decoder-only transformer, loosely based on "Attention is All You Need"
`tiny_fg_llama.yaml`	Llama	Pre-layer-norm, RoPE, SiLU activation, GQA
`tiny_deepone.yaml`	DeepOne	Post-layer-norm, Deepnet initialization, ALiBi positional encoding
`tiny_fg_mistral.yaml`	Mistral	Llama variant with sliding-window attention and GQA
`tiny_fg_qwen3.yaml`	Qwen3	Qwen3 architecture
`tiny_llama_canon.yaml`	Llama + Canon	Llama with Canon convolutional layers for local token mixing
`tiny_single_head.yaml`	Single-Head ALiBi	Single attention head with ALiBi, using eager attention

HuggingFace model implementations (for comparison with Forgather equivalents):

Config	Architecture	Notes
`tiny_hf_llama.yaml`	HF LlamaForCausalLM	Same architecture as `tiny_fg_llama`, HF implementation
`tiny_hf_gpt2.yaml`	HF GPT2LMHeadModel	GPT-2 architecture

Usage¶

# Train a single model
forgather -t tiny_fg_llama.yaml train

# Train all models for comparison
for cfg in $(forgather ls | grep -oP '\S+\.yaml'); do
    forgather -t $cfg train
done

# Compare loss curves
forgather logs plot --compare output_models/*/runs/*/trainer_logs.json --loss-curves

Adding a new model¶

Create a model project under examples/models/ (or use an existing one)
Add a config in templates/configs/ that sets ns.model_project_dir and ns.model_project_config to point to your model
Run forgather ls to verify it parses correctly