Canon Layer Experiments¶

Experiments with Canon layers from "Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers" (Allen-Zhu, 2025).

Canon layers are small depthwise causal 1D convolutions (kernel_size=4) inserted at four positions in each transformer block (A: pre-attention, B: on QKV, C: pre-FFN, D: on gate/up). They provide local token mixing with minimal overhead.

Questions¶

How much does RoPE contribute when Canon layers provide local token mixing?
Does Canon compensate for the absence of positional encoding?
Which subset of Canon positions (A, B, C, D) gives the best quality/speed tradeoff?
Can Canon-A alone replace RoPE while maintaining quality?

Results¶

All models are 4M parameter, trained for 1 epoch on TinyStories (abridged), batch_size=32, max_seq_len=512, AdamW lr=1e-3 with InfiniteLR scheduler.

Positional Encoding Ablation¶

Model	Eval Loss	Tok/s	Memory	Notes
Llama 4M	1.3011	291K	1.50 GiB	Reference
Canon ABCD (RoPE)	1.2256	227K	2.12 GiB	Full Canon baseline
Canon ABCD (NoPE)	1.2379	229K	2.11 GiB	Canon, no RoPE
Canon-A (NoPE)	1.3120	323K	1.54 GiB	Minimal Canon, no RoPE
Llama NoPE	1.3667	348K	1.49 GiB	No Canon, no RoPE

Removing RoPE from full Canon costs 1.0% eval loss. Removing RoPE from Llama costs 5.0%. Canon layers largely compensate for the lack of positional encoding.

Canon-A NoPE is particularly interesting: at 323K tok/s it is 11% faster than Llama+RoPE, with only 0.8% worse eval loss (1.3120 vs 1.3011). The single pre-attention convolution recovers most of the positional encoding benefit that RoPE provides, at a fraction of the cost. At larger model sizes, where RoPE's complex-valued computation and memory footprint become significant, this tradeoff becomes even more favorable.

Canon Position Ablation¶

Model	Positions	Eval Loss	Tok/s	Memory	Overhead vs Llama
Llama 4M	--	1.3011	291K	1.50 GiB	--
Canon-A	A	1.2643	316K	1.56 GiB	+9% faster, +4% mem
Canon-A NoPE	A (no RoPE)	1.3120	323K	1.54 GiB	+11% faster, +3% mem
Canon-AB	A, B	1.2601	262K	1.73 GiB	-10% time, +15% mem
Canon-AC	A, C	1.2420	296K	1.62 GiB	+2% faster, +8% mem
Canon-B	B	1.2149	278K	1.67 GiB	-4% time, +11% mem
Canon-BD	B, D	1.2557	250K	2.01 GiB	-14% time, +34% mem
Canon-ABCD	A, B, C, D	1.2256	227K	2.12 GiB	-22% time, +41% mem

Analysis¶

Best quality: Canon-B alone achieves the lowest eval loss (1.2149), even beating full ABCD (1.2256). This is unexpected and suggests that at this model size, the QKV convolution is the single most impactful position.

Fastest with meaningful gains: Canon-A alone runs at 316K tok/s -- 9% faster than plain Llama (291K) despite adding a convolution per layer. The speedup likely comes from the Triton kernel being faster than the standard module overhead it replaces in the code path. Canon-A achieves 2.8% better eval loss (1.2643 vs 1.3011) with essentially zero cost.

Canon-A and induction heads: Canon-A sits before attention, mixing adjacent token representations. This is exactly the pattern needed to facilitate induction head formation: the first attention head can match copied-forward keys from the convolution, enabling better sequence copying. The consistent improvement from Canon-A alone supports this hypothesis.

Canon-A as RoPE replacement: Canon-A NoPE (1.3120) nearly matches Llama+RoPE (1.3011) -- just 0.8% worse -- while running 11% faster. The pre-attention convolution provides enough local positional signal that explicit positional encoding becomes nearly redundant. This is promising for larger models where RoPE's O(d_head) complex computation per head is costly.

Best quality/speed tradeoff: Canon-AC runs at 296K tok/s (2% faster than Llama) while achieving 4.5% better eval loss (1.2420 vs 1.3011).

Canon-B surprise: Canon-B at 278K tok/s delivers the best eval loss (1.2149) at moderate overhead. This suggests the QKV convolution provides complementary benefits beyond what pre-attention mixing achieves.

Combining A+B hurts: Canon-AB (1.2601) is worse than B alone (1.2149) and barely better than A alone (1.2643). The pre-attention convolution (A) and QKV convolution (B) may interfere -- A mixes tokens before attention, then B mixes the already-mixed representations again, potentially blurring the signal. Canon-B works best in isolation at this scale.

Canon-D adds overhead without benefit: Comparing B-only (1.2149) vs BD (1.2557), adding Canon-D actually hurts quality while costing 10% throughput. The feedforward gate/up convolutions may interfere at this model size.

Configurations¶

Model configs (in models/ sub-project): - baseline.yaml -- Canon ABCD 4M (full Canon, identical to llama_canon 4M) - nope.yaml -- Canon ABCD 4M without RoPE - canon_a.yaml -- Canon-A: only pre-attention convolution - canon_a_nope.yaml -- Canon-A without RoPE - canon_ab.yaml -- Canon-AB: pre-attention (A) and attn QKV (B) - canon_ac.yaml -- Canon-AC: pre-attention (A) and pre-FFN (C) - canon_b.yaml -- Canon-B: only attention QKV convolution - canon_bd.yaml -- Canon-BD: attention QKV (B) and FFN gate/up (D)

Model configs (in llama_nope/ sub-project): - nope_4M.yaml -- Plain Llama 4M without RoPE

Training configs (prefix train_): - train_baseline.yaml, train_nope.yaml, train_llama_nope.yaml - train_canon_a.yaml, train_canon_a_nope.yaml - train_canon_ab.yaml, train_canon_ac.yaml - train_canon_b.yaml, train_canon_bd.yaml

Project Structure¶

examples/tiny_experiments/canon/        # Main experiment project
  meta.yaml                             # Training project metadata
  templates/
    project.yaml                        # Training base (extends projects/tiny.yaml)
    configs/
      train_*.yaml                      # Training configs

  models/                               # Model definitions sub-project
    meta.yaml                           # Adds llama_canon templates to search path
    templates/configs/
      baseline.yaml                     # Model: Canon ABCD (extends llama_canon 4M)
      nope.yaml                         # Model: Canon ABCD, no RoPE
      canon_a.yaml                      # Model: Canon A only
      canon_a_nope.yaml                 # Model: Canon A only, no RoPE
      canon_ab.yaml                     # Model: Canon A+B
      canon_ac.yaml                     # Model: Canon A+C
      canon_b.yaml                      # Model: Canon B only
      canon_bd.yaml                     # Model: Canon B+D

  llama_nope/                           # Companion model sub-project
    meta.yaml                           # Adds llama templates to search path
    templates/configs/
      nope_4M.yaml                      # Model: Llama 4M without RoPE