H.P. Lovecraft fine-tunes: investigation of the 4K-periodic NLL spike¶
This document is the investigation log for the 4096-token-periodic per-token
NLL spikes observed in all Lovecraft long-context fine-tunes. See
long_context_experiments.md for the parent experiment.
TL;DR. The spikes were a configuration artefact, not a model phenomenon.
The tutorial's old base_finetune_proj.yaml template declared --window-size
as a CLI flag but its dataset_pp_kwargs block only forwarded chat_template
to the dataset project. The block tokenizer therefore received the default
max_length=4096 regardless of what was passed on the command line. Every
"16K training" was 4K-tokenized data padded to 16K by the collator; the model
learned a 4K block structure and reproduced it as periodic NLL spikes on
continuous-text evaluation. The fix was to migrate to projects/finetune_v2.yaml,
which correctly threads ns.seq_len through datasets_preprocessor_args.max_length.
The surviving reasoning trail below is kept because two earlier root-cause theories (packed-data document boundaries, RoPE resonance) looked plausible enough to be written up before the plumbing gap was found. They are wrong; what ruled them out and what replaced them is described at the bottom.
Observation¶
All five trained Lovecraft variants (Mistral / Llama × default / YaRN × sliding / noslide) show exactly-4096-token-periodic NLL spikes when evaluated on a continuous long-context held-out story. The spike phase varies by story (4352 in At the Mountains of Madness, 2304 in The Case of Charles Dexter Ward), but within any given story the spacing is always exactly 4096. Peak values are 2-3 NLL higher than the adjacent low, and NLL decays smoothly over ~4K tokens back to the low before the next spike.
The untrained base model does not show this pattern -- it has a smooth NLL with gradual degradation past the pretrain context. So the spike is introduced by fine-tuning.
Refuted hypotheses¶
Story-content explanation (refuted)¶
Three spikes at exact 4K multiples can't plausibly be story content coincidence. And the phase-vs-content dependence argues against pure position-locking. Partial evidence: the story text at spike positions is ordinary prose, not structural markers. Confirmed refuted by subsequent experiments.
Mistral sliding-window hypothesis (refuted)¶
Initially suspected that Mistral's sliding_window=4096 caused the first spike via the BOS-drops-out-of-window mechanism (Xiao et al. 2023, Efficient Streaming Language Models with Attention Sinks, arxiv:2309.17453). Refuted by two observations:
- Sliding window is continuous past the first transition; it doesn't predict additional spikes at 8K and 12K.
- Investigation of the Forgather code found that the SDPA path was ignoring
the sliding_window setting entirely --
causal_mask.pyreadconfig.window_sizebut the field is namedsliding_window, so the create_sliding_window_causal_mask branch was never taken. Both "with sliding" and "without sliding" variants were architecturally identical. (Separately fixed.)
RoPE frequency resonance (refuted)¶
Observed that the geometric distribution of RoPE inv-freqs with
rope_theta=10000, d_head=128 places many dimensions (roughly dims 28-45) at
periods that are integer-fraction harmonics of 4096. Hypothesised that
collective phase alignment at 4K multiples might be driving the spikes.
Refuted by direct experiment. Trained four Mistral variants with
rope_theta ∈ {5000, 20000, 40000, 500000} (spanning a 100× range) and one
Llama-2-7B with rope_type: "llama3". If the hypothesis were correct, spike
spacing should scale with rope_theta^0.703:
| Variant | rope_theta | Predicted dim-45 period | Observed spikes | Observed spacing |
|---|---|---|---|---|
| theta=5000 | 5000 | 2506 | 4224, 8320, 12672 | 4096 |
| theta=10000 (baseline) | 10000 | 4080 | 4352, 8448, 12544 | 4096 |
| theta=20000 | 20000 | 6643 | 4224, 8320, 12672 | 4096 |
| theta=40000 | 40000 | 10814 | 4224, 8320, 12672 | 4096 |
| theta=500000 | 500000 | 63866 (past eval window) | 4224, 8320, 12672 | 4096 |
| llama3_rope | — | — | 4224, 8320, 12416 | 4096 |
Spike spacing is independent of rope_theta across a 100× range. Even at theta=500000, where the fundamental RoPE period (63866) is almost 4× the entire eval window, spikes still land at exactly 4K intervals. RoPE is not the cause.
Packed-data document-boundary distribution (refuted)¶
A working hypothesis at one point was that the spikes came from the
document-boundary statistics of the packed training data -- the Lovecraft
corpus has a median story length near 4K tokens, so greedy packing into long
blocks would concentrate document_starts near 4K multiples. The model was
assumed to be learning "attention resets every ~4K tokens" from these
boundaries. Supporting evidence at the time: non-packed training (single
document per block, padded) showed a flat NLL profile with no 4K spikes.
This theory is also wrong -- or rather, it was a correct description of the symptom but not the mechanism. The real mechanism is below; the document-boundary statistics of the 4K-tokenized data happened to align with the 4K spike pattern because the blocks were literally 4K tokens long.
Actual cause: the window_size plumbing gap¶
At this point the eval-side hypotheses had been exhausted; further progress required instrumenting the training-side to confirm what the model was actually being trained on.
A single print in block_tokenize_fn settled it:
...despite every training command passing --window-size 16384. The
training command line was accepting the flag without error, and the
preprocessed config was showing window_size: 16384, but the value was never
reaching the tokenizer.
Tracing back through the templates, the problem was in the old
templatelib/finetune/projects/base_finetune_proj.yaml. That template
declared window_size as a dynamic CLI arg, but its [dataset_pp_kwargs]
block only forwarded chat_template to the dataset project. Nothing in the
chain fed ns.window_size into the dataset's block_tokenize_fn call --
so the tokenizer kept its built-in default of max_length=4096.
At 16K sequence length, the data collator then padded each real 4K block out
to 16K tokens. Pre-packed data (most of this tutorial's experiments) was
slightly different: 4K-tokenized blocks got concatenated into 16K blocks at
collator time, but each 4K region came from a single block_tokenize_fn
call with max_length=4096, which is exactly what creates the 4K structural
cadence the model eventually learned.
The newer projects/finetune_v2.yaml + newer dataset API does this
correctly: the training project sets ns.seq_len, which feeds into
[datasets_preprocessor_args] under the max_length key, which the dataset
template unpacks into block_tokenize_fn's max_length kwarg. After
migrating the reference project, instrumenting the tokenizer again shows:
which is what it should have been saying all along.
Why this produces exactly-4K-periodic NLL spikes¶
With max_length=4096 at tokenization time, every training block is a
4K-token packed bundle of complete-or-truncated documents. Whatever
statistical structure the corpus has -- story boundaries, <BOS> tokens,
long-range coreference decays -- it repeats every 4K tokens across
training. The model learns the joint of "token at position p within a 4K
bundle" rather than the joint over a true 16K context.
At eval on a continuous 16K story, absolute positions 0..4095 look like familiar within-bundle positions and NLL is low. Positions 4096..8191 still look like within-bundle positions to the model's learned weights, but the content continues from the previous 4K smoothly -- which conflicts with the learned expectation that attention structure resets at each 4K boundary. Peak NLL lands a few hundred tokens past each 4K boundary (4352, 8448, 12544) because it takes a little content to wedge against the model's expectation; NLL then decays over the following 4K.
Why the story phase varies¶
The model's learned expectation is "after N tokens of familiar within-bundle context, something structural happens." The exact NLL peak inside each 4K window is where story content first surprises the learned expectation -- a content-dependent phase, which is why ATMoM peaks at 4352 and Charles Dexter Ward peaks at 2304.
Why non-packed training had no spikes¶
Non-packed training produces one document per 4K block (padded), so the learned joint is effectively "token position within a single document, documents up to 4K long." There is no periodic reset of content structure at 4K -- the content doesn't continue past the document end, it hits padding. Evaluating on a continuous story doesn't trigger the same expectation-collision.
This is consistent with the earlier observation that non-packed training had a flat NLL profile. That observation was correct; the interpretation (packed document boundaries are the cause) was wrong. The actual cause is the 4K tokenization window, which non-packed training exposes to the model as document-length-truncation rather than as a periodic boundary.
What the parent doc's experiments actually measured¶
Every result in long_context_experiments.md was produced with
max_length=4096 at tokenization time, not the intended 16K. What was
being compared across variants (YaRN on/off, sliding-window on/off,
rope_theta sweep, context-length sweep) is still a valid comparison of
fine-tunes on 4K-tokenized data evaluated at 16K, because the plumbing
gap affected every variant identically. But the headline numbers should
not be read as "what long-context fine-tuning buys you"; they are "what a
4K fine-tune buys you when evaluated at 16K, with/without each
intervention."
See the parent doc for updated caveats.
Lessons¶
- Instrument the leaf, not the config. The preprocessed config
(
forgather pp) looked right --window_size: 16384was everywhere -- which made the plumbing gap invisible from the template side. A single print inblock_tokenize_fnclosed the investigation that multiple months of hypothesis-testing couldn't. - Template inheritance hides declarations. The old template accepted
--window-sizewithout forwarding it, which is worse than rejecting it: the user gets no error, and the config that was "obviously right" silently runs on default data.finetune_v2.yamlnarrows the declaration surface and explicitly passesns.seq_lenasmax_lengthindatasets_preprocessor_args, which is harder to forget. - Believe the unusual but consistent behaviour. Three spikes at exactly 4096-token spacing, independent of rope_theta, context length, model family, and variant, is not a subtle architectural resonance -- it is 4K shouting its existence at the evaluator. The early "4K is weirdly special" framing in the parent doc should have been read as a hint that 4K was literally the block size, not an emergent property.
Infrastructure (archived)¶
/tmp/run_theta_sweep_eval.sh-- parallel eval across 5 GPUs/tmp/analyze_spikes.py-- extract spike positions from eval reports/tmp/rope_phase_analysis.py,/tmp/rope_similarity_to_origin.py-- RoPE phase analysis that partially invalidated the resonance hypothesis before the training sweep sealed it