Canon Layer Experiments¶
Experiments with Canon layers from "Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers" (Allen-Zhu, 2025).
Canon layers are small depthwise causal 1D convolutions (kernel_size=4) inserted at four positions in each transformer block (A: pre-attention, B: on QKV, C: pre-FFN, D: on gate/up). They provide local token mixing with minimal overhead.
Questions¶
- How much does RoPE contribute when Canon layers provide local token mixing?
- Does Canon compensate for the absence of positional encoding?
- Which subset of Canon positions (A, B, C, D) gives the best quality/speed tradeoff?
- Can Canon-A alone replace RoPE while maintaining quality?
Results¶
All models are 4M parameter, trained for 1 epoch on TinyStories (abridged), batch_size=32, max_seq_len=512, AdamW lr=1e-3 with InfiniteLR scheduler.
Positional Encoding Ablation¶
| Model | Eval Loss | Tok/s | Memory | Notes |
|---|---|---|---|---|
| Llama 4M | 1.3011 | 291K | 1.50 GiB | Reference |
| Canon ABCD (RoPE) | 1.2256 | 227K | 2.12 GiB | Full Canon baseline |
| Canon ABCD (NoPE) | 1.2379 | 229K | 2.11 GiB | Canon, no RoPE |
| Canon-A (NoPE) | 1.3120 | 323K | 1.54 GiB | Minimal Canon, no RoPE |
| Llama NoPE | 1.3667 | 348K | 1.49 GiB | No Canon, no RoPE |
Removing RoPE from full Canon costs 1.0% eval loss. Removing RoPE from Llama costs 5.0%. Canon layers largely compensate for the lack of positional encoding.
Canon-A NoPE is particularly interesting: at 323K tok/s it is 11% faster than Llama+RoPE, with only 0.8% worse eval loss (1.3120 vs 1.3011). The single pre-attention convolution recovers most of the positional encoding benefit that RoPE provides, at a fraction of the cost. At larger model sizes, where RoPE's complex-valued computation and memory footprint become significant, this tradeoff becomes even more favorable.
Canon Position Ablation¶
| Model | Positions | Eval Loss | Tok/s | Memory | Overhead vs Llama |
|---|---|---|---|---|---|
| Llama 4M | -- | 1.3011 | 291K | 1.50 GiB | -- |
| Canon-A | A | 1.2643 | 316K | 1.56 GiB | +9% faster, +4% mem |
| Canon-A NoPE | A (no RoPE) | 1.3120 | 323K | 1.54 GiB | +11% faster, +3% mem |
| Canon-AB | A, B | 1.2601 | 262K | 1.73 GiB | -10% time, +15% mem |
| Canon-AC | A, C | 1.2420 | 296K | 1.62 GiB | +2% faster, +8% mem |
| Canon-B | B | 1.2149 | 278K | 1.67 GiB | -4% time, +11% mem |
| Canon-BD | B, D | 1.2557 | 250K | 2.01 GiB | -14% time, +34% mem |
| Canon-ABCD | A, B, C, D | 1.2256 | 227K | 2.12 GiB | -22% time, +41% mem |
Analysis¶
Best quality: Canon-B alone achieves the lowest eval loss (1.2149), even beating full ABCD (1.2256). This is unexpected and suggests that at this model size, the QKV convolution is the single most impactful position.
Fastest with meaningful gains: Canon-A alone runs at 316K tok/s -- 9% faster than plain Llama (291K) despite adding a convolution per layer. The speedup likely comes from the Triton kernel being faster than the standard module overhead it replaces in the code path. Canon-A achieves 2.8% better eval loss (1.2643 vs 1.3011) with essentially zero cost.
Canon-A and induction heads: Canon-A sits before attention, mixing adjacent token representations. This is exactly the pattern needed to facilitate induction head formation: the first attention head can match copied-forward keys from the convolution, enabling better sequence copying. The consistent improvement from Canon-A alone supports this hypothesis.
Canon-A as RoPE replacement: Canon-A NoPE (1.3120) nearly matches Llama+RoPE (1.3011) -- just 0.8% worse -- while running 11% faster. The pre-attention convolution provides enough local positional signal that explicit positional encoding becomes nearly redundant. This is promising for larger models where RoPE's O(d_head) complex computation per head is costly.
Best quality/speed tradeoff: Canon-AC runs at 296K tok/s (2% faster than Llama) while achieving 4.5% better eval loss (1.2420 vs 1.3011).
Canon-B surprise: Canon-B at 278K tok/s delivers the best eval loss (1.2149) at moderate overhead. This suggests the QKV convolution provides complementary benefits beyond what pre-attention mixing achieves.
Combining A+B hurts: Canon-AB (1.2601) is worse than B alone (1.2149) and barely better than A alone (1.2643). The pre-attention convolution (A) and QKV convolution (B) may interfere -- A mixes tokens before attention, then B mixes the already-mixed representations again, potentially blurring the signal. Canon-B works best in isolation at this scale.
Canon-D adds overhead without benefit: Comparing B-only (1.2149) vs BD (1.2557), adding Canon-D actually hurts quality while costing 10% throughput. The feedforward gate/up convolutions may interfere at this model size.
Configurations¶
Model configs (in models/ sub-project):
- baseline.yaml -- Canon ABCD 4M (full Canon, identical to llama_canon 4M)
- nope.yaml -- Canon ABCD 4M without RoPE
- canon_a.yaml -- Canon-A: only pre-attention convolution
- canon_a_nope.yaml -- Canon-A without RoPE
- canon_ab.yaml -- Canon-AB: pre-attention (A) and attn QKV (B)
- canon_ac.yaml -- Canon-AC: pre-attention (A) and pre-FFN (C)
- canon_b.yaml -- Canon-B: only attention QKV convolution
- canon_bd.yaml -- Canon-BD: attention QKV (B) and FFN gate/up (D)
Model configs (in llama_nope/ sub-project):
- nope_4M.yaml -- Plain Llama 4M without RoPE
Training configs (prefix train_):
- train_baseline.yaml, train_nope.yaml, train_llama_nope.yaml
- train_canon_a.yaml, train_canon_a_nope.yaml
- train_canon_ab.yaml, train_canon_ac.yaml
- train_canon_b.yaml, train_canon_bd.yaml
Project Structure¶
examples/tiny_experiments/canon/ # Main experiment project
meta.yaml # Training project metadata
templates/
project.yaml # Training base (extends projects/tiny.yaml)
configs/
train_*.yaml # Training configs
models/ # Model definitions sub-project
meta.yaml # Adds llama_canon templates to search path
templates/configs/
baseline.yaml # Model: Canon ABCD (extends llama_canon 4M)
nope.yaml # Model: Canon ABCD, no RoPE
canon_a.yaml # Model: Canon A only
canon_a_nope.yaml # Model: Canon A only, no RoPE
canon_ab.yaml # Model: Canon A+B
canon_ac.yaml # Model: Canon A+C
canon_b.yaml # Model: Canon B only
canon_bd.yaml # Model: Canon B+D
llama_nope/ # Companion model sub-project
meta.yaml # Adds llama templates to search path
templates/configs/
nope_4M.yaml # Model: Llama 4M without RoPE