Custom Canon (Canon-A + NoPE + QK-Norm)¶

A custom-model sub-project for the small-llm pretraining example. It defines a single architecture variant - "Canon-A only, no positional encoding, with QK-Norm" - sized to match the project's default Llama medium so the two can be compared head-to- head on identical training runs.

This sub-project is the model half of the example pattern in the parent project: training projects can point at it via ns.model_project_dir / ns.model_project_config (see the parent's canon.yaml).

What "Canon" means here¶

Canon layers are small depthwise causal 1D convolutions (kernel=4) inserted inside each transformer block to provide cheap, local token mixing. The layers and their insertion points come from Allen-Zhu's Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers (arXiv:2512.17351, 2025), and the four positions defined there are:

Position	Where it sits	Tensor dim
A	after input LayerNorm, before attention	`hidden_size`
B	on the QKV projections, before attention compute	`(n_heads + 2·n_kv_heads) · d_head`
C	after post-attention LayerNorm, before the MLP	`hidden_size`
D	on the gate/up projections in the SwiGLU MLP	`2 · intermediate_size`

A given model can use any subset of {A, B, C, D}. The reusable implementation lives in examples/models/llama_canon/ - the same Llama backbone with optional Canon layers at each position. This sub-project consumes that model project and selects a specific subset (A only) plus a couple of orthogonal architecture choices.

What this variant changes from baseline Llama¶

Concretely, relative to the parent project's default examples/models/llama/medium.yaml:

Canon-A enabled, Canon-B/C/D disabled. The pre-attention depthwise convolution mixes adjacent token representations before attention sees them.
NoPE - no rotary or relative positional encoder. Canon-A's local convolution is expected to provide enough positional signal for attention to recover sequence order on its own.
QK-Norm - RMSNorm applied to Q and K before the attention score computation, Qwen3-style. Not part of Canon proper; included here as a stabiliser that's cheap and tends to help at small scales.
Same parameter count and shape as the baseline: hidden_size = 768, intermediate_size = 2048, 16 layers, 8 attention heads. This matches Llama medium so the eval loss comparison in the parent project is head-to-head at matched compute.
Same WikiText 32k tokenizer as the baseline.

Why this specific configuration¶

The choice of "Canon-A only + NoPE" is informed by the more thorough ablation in examples/tiny_experiments/canon/, which sweeps the {A, B, C, D} subset on a 4M Llama trained on TinyStories. Two findings drove the decision here:

Canon-A alone runs faster than baseline Llama, despite adding a convolution per layer. In the 4M ablation, Canon-A clocks 316K tok/s vs 291K for Llama (+9%), with a 2.8% better eval loss (1.264 vs 1.301).
Canon-A nearly replaces RoPE. Canon-A NoPE achieves 1.312 eval vs Llama+RoPE's 1.301 - about 0.8% worse - while running 11% faster (323K vs 291K tok/s). The single pre-attention convolution provides enough local positional signal that explicit positional encoding becomes nearly redundant. At larger model sizes, where RoPE's complex-valued computation and memory footprint grow, this trade is expected to look more favourable.

So the design intent of custom_canon is "minimal Canon, no RoPE, should be at least as fast as the baseline and competitive on eval." The 4M tiny-experiments numbers say this should win on throughput and at least tie on eval at small scale.

The parent project's empirical result at ~162M scale is more sobering - see below.

Configurations¶

forgather ls

Config	Purpose
`custom_canon.yaml`	Canon-A + NoPE + QK-Norm, sized to match Llama medium.

Only the one config is shipped. The training entry point is canon.yaml in the parent project, which points its ns.model_project_dir at this sub-project.

Empirical result in the parent project¶

In the parent project's 1x Chinchilla bundle, this configuration finishes with a best eval loss of 2.678 - the worst of the 1x runs, ~0.06 above high_lr and ~0.03 above the two pure-bf16 runs. Throughput was also lower than the Llama variants (~39 samp/s vs ~45-49 for the rest of the bundle), which contradicts the 4M tiny-experiments expectation that Canon-A alone should be faster than baseline.

Likely contributors to the throughput regression at this scale: the Canon convolution implementation has not been re-tuned for larger models, and the kernel that ran 9-11% faster than Llama at 4M may have lost its edge as hidden_size and sequence length grew. The eval-loss gap is more interesting and probably wants a proper LR sweep for this variant before any conclusion - the parent project picked LRs tuned for the baseline Llama, and Canon-A + NoPE + QK-Norm is a different enough architecture that the optimum may sit elsewhere.

The ablation in tiny_experiments/canon also identifies better Canon variants at 4M scale - notably Canon-B alone (1.215 eval, the lowest of any Canon subset) and Canon-AC (best quality-per-throughput at 1.242 eval and 296K tok/s). Either would be a more obvious next thing to try at 162M than re-running Canon-A, especially Canon-AC since A and C are inexpensive.

Cross-references¶

Parent project README for the head-to-head 1x Chinchilla results and how canon.yaml plugs this sub-project into training.
examples/tiny_experiments/canon/ for the full Canon position-and-RoPE ablation at 4M scale.
examples/models/llama_canon/ for the underlying model project this one extends, including parameter-overhead numbers and the canon_kernel / canon_residual knobs.
Allen-Zhu, Physics of Language Models: Part 4.1, arXiv:2512.17351, 2025 - the original Canon paper.