Training Log Analysis¶
Forgather provides powerful tools for analyzing and visualizing training logs through the forgather logs command. The logger automatically records training metrics to trainer_logs.json files in each training run directory.
Log Format¶
Training logs are stored as JSON arrays with the following structure:
[
{"timestamp": 1769495119.472016, "global_step": 100, "epoch": 0.015, "loss": 6.171, "grad_norm": 1.022, "learning_rate": 0.000282},
{"timestamp": 1769495123.907347, "global_step": 200, "epoch": 0.060, "loss": 3.771, "grad_norm": 0.593, "learning_rate": 0.000564},
{"timestamp": 1769495136.678786, "global_step": 500, "epoch": 0.151, "eval_loss": 2.314},
...
{"timestamp": 1769495257.024434, "global_step": 3312, "epoch": 1.0, "train_runtime": 141.48, "train_samples": 211904}
]
Each record contains: - Training records: timestamp, global_step, epoch, loss, grad_norm, max_grad_norm, learning_rate - Evaluation records: timestamp, global_step, epoch, eval_loss - Final summary: train_runtime, train_samples, train_samples_per_second, train_steps_per_second, effective_batch_size
Available Commands¶
List Training Logs¶
List all available training logs in the current project:
Output:
Found 4 training log(s):
1. default_model/iterable_2026-01-27T06-25-07
Path: output_models/default_model/runs/iterable_2026-01-27T06-25-07/trainer_logs.json
Modified: 2026-01-27 06:27:37
2. default_model/sharded_fast_2026-01-26T11-09-33
Path: output_models/default_model/runs/sharded_fast_2026-01-26T11-09-33/trainer_logs.json
Modified: 2026-01-26 11:11:27
Generate Summary Statistics¶
Generate summary statistics from a training log:
# Auto-detect latest log in project
forgather logs summary
# Specify log file explicitly
forgather logs summary path/to/trainer_logs.json
# Or use run directory
forgather logs summary path/to/run_dir
Output formats:
# Text (default)
forgather logs summary --format text
# JSON
forgather logs summary --format json
# Markdown
forgather logs summary --format md
# One-line (compact summary)
forgather logs summary --format one-line
Process all logs:
# Summarize all logs in project (one-line format recommended)
forgather logs summary --all --format one-line
# All logs in text format
forgather logs summary --all --format text
# All logs as JSON array
forgather logs summary --all --format json
Save to file:
forgather logs summary --output summary.txt
forgather logs summary --format json --output summary.json
Example output (text format):
Training Run Summary
============================================================
Run: iterable_2026-01-27T06-25-07
Duration: 141.48s
Total Steps: 3300
Final Epoch: 0.9964
Metrics:
Final Loss: 1.4422
Best Loss: 1.4422 (step 3300)
Average Loss: 1.9881
Final Eval Loss: 1.3666 (step 3000)
Best Eval Loss: 1.3666 (step 3000)
Training Speed:
Samples/sec: 1497.78
Steps/sec: 23.40
Effective Batch Size: 64
Gradient Statistics:
Average Grad Norm: 0.4614
Max Grad Norm: 1.0218 (step 100)
Learning Rate:
Initial: 0.000282
Final: 0.001400
Generate Plots¶
Generate plots from training logs.
Default behavior: Saves plot to tmp/ directory (gitignored, easy to cleanup).
Basic usage:
# Save to tmp/ directory (default)
forgather logs plot
# Creates: tmp/training_plot.png
# Open plot in editor after generation
forgather logs plot -e
# Creates tmp/training_plot.png and opens in VS Code (works on remote SSH)
# Loss curves plot
forgather logs plot --loss-curves
# Creates: tmp/loss_curves.png
# Custom output location
forgather logs plot --output figures/training.png
# Specify output format (default: png)
forgather logs plot --output plot --format svg # Creates: plot.svg
forgather logs plot --output plot --format pdf # Creates: plot.pdf
# With config name prefix
forgather -t my_config.yaml logs plot
# Creates: tmp/my_config_training_plot.png
X-axis options:
# Plot by global step (default)
forgather logs plot --x-axis step
# Plot by epoch
forgather logs plot --x-axis epoch
# Plot by time (minutes)
forgather logs plot --x-axis time
Select metrics to plot:
# Plot specific metrics
forgather logs plot --metrics "loss,eval_loss,learning_rate"
# Plot gradient statistics
forgather logs plot --metrics "grad_norm,max_grad_norm"
Smoothing:
# Apply moving average smoothing
forgather logs plot --smooth 10
# Smoothing with window size 20
forgather logs plot --metrics "loss" --smooth 20
Loss curves plot:
Generate a specialized loss curves plot with learning rate on secondary axis:
Other options:
# Use log scale for y-axis
forgather logs plot --log-scale
# Plot by time instead of steps
forgather logs plot --x-axis time
# Plot gradient norm (own layout; cannot combine with --loss-curves/--metrics)
forgather logs plot --grad-norm --smooth 10
# Plot losses as perplexity (exp(loss)); composes with --log-scale
forgather logs plot --loss-curves --perplexity --log-scale
# Clip the plot window in data space (respects --x-axis units)
forgather logs plot --x-max 36448 --metrics eval_loss
# Override y-axis bounds; takes precedence over --ignore-outliers
forgather logs plot --y-min 2.2 --y-max 3.5
Outlier-aware auto-scaling
By default, forgather logs plot computes the y-axis window for loss-like
metrics (loss, eval_loss, train_loss, grad_norm, max_grad_norm) from
the 5th/95th percentiles of the plotted series instead of the raw min/max.
This keeps the huge early-training values from squashing the tail of the
curve, similar to TensorBoard's "Ignore outliers in chart scaling" option.
Use --no-ignore-outliers to restore the previous full-range behaviour, or
--y-min / --y-max to pin the bounds explicitly. Learning-rate subplots
and secondary axes are never auto-clipped.
Plot mode and scaling options:
| Flag | Description |
|---|---|
--loss-curves |
Dual-axis train/eval loss with learning rate on a secondary axis. |
--grad-norm |
Gradient-norm plot; mutually exclusive with --loss-curves / --metrics. |
--ignore-outliers / --no-ignore-outliers |
Enable (default) or disable percentile-based y-axis auto-scaling. |
--perplexity |
Plot loss-like metrics as exp(loss) with relabelled axes. |
--x-min / --x-max |
Clip plotted data and x-axis to the given domain (in --x-axis units). |
--y-min / --y-max |
Override y-axis bounds; takes precedence over --ignore-outliers. |
Compare Multiple Runs¶
Compare metrics across multiple training runs:
# Compare two runs
forgather logs plot --compare run1/trainer_logs.json run2/trainer_logs.json
# Compare with loss curves plot
forgather logs plot --compare run1/trainer_logs.json run2/trainer_logs.json --loss-curves
# Save comparison
forgather logs plot --compare run1/trainer_logs.json run2/trainer_logs.json \
--output comparison.png --no-show
Usage Examples¶
Quick Training Analysis¶
After training, quickly check how it went:
Compare All Runs at a Glance¶
View all training runs in a compact table:
Output:
Run Name | Steps | Time | Loss | Eval | Throughput
------------------------------------------------------------------------------------------------
iterable_2026-01-27T06-25-07 | steps=3300 | time=02:21 | loss=1.4422 | eval=1.3666 | samp/s=1497.8
sharded_fast_2026-01-26T11-09- | steps=3300 | time=01:47 | loss=1.4302 | eval=1.3100 | samp/s=1968.0
sharded-iterable_2026-01-26T10 | steps=3300 | time=01:47 | loss=1.4304 | eval=1.3153 | samp/s=1977.3
Compare Optimizer Experiments¶
Compare training runs with different optimizers:
cd examples/tiny_experiments/optimizers
forgather logs plot --compare \
output_models/tiny_causal/runs/adamw_*/trainer_logs.json \
output_models/tiny_causal/runs/apollo_*/trainer_logs.json \
--loss-curves --smooth 5 --output optimizer_comparison.png --no-show
Analyze Loss Convergence¶
Plot smoothed loss curves to see convergence behavior:
Export for Further Analysis¶
Export summary statistics as JSON for programmatic analysis:
Generate Training Report¶
Create a markdown report with summary and plots:
# Generate summary
forgather logs summary --format md --output report.md
# Generate plots
forgather logs plot --loss-curves --output loss_plot.png --no-show
forgather logs plot --metrics "grad_norm" --output grad_plot.png --no-show
Per-Parameter Diagnostic Logging¶
Forgather provides callbacks for logging per-parameter weight norms, spectral norms, and gradient norms to JSON files. These are useful for diagnosing training instability, such as the issues described in arxiv 2510.04212.
Diagnostic logs are written on each evaluation step to keep overhead low. They are saved alongside trainer_logs.json in the run directory.
Enabling Diagnostic Callbacks¶
Add the callbacks to your trainer:
from forgather.ml.trainer.callbacks import ParameterNormLogger, GradNormLogger
callbacks = [
# Per-parameter L2 norms and spectral norms
ParameterNormLogger(
log_norms=True, # Log per-parameter L2 norms
log_spectral_norms=True, # Log per-parameter spectral norms
power_iter_steps=10, # Power iteration steps (accuracy vs speed)
),
# Per-parameter gradient norms
GradNormLogger(),
# ... your other callbacks
]
trainer = Trainer(model=model, args=args, callbacks=callbacks, ...)
trainer.train()
Both callbacks support checkpoint resume via the Stateful protocol. On resume, log files are truncated to the checkpoint step and appending continues.
ParameterNormLogger¶
Writes parameter_norms.json containing per-parameter L2 norms and/or spectral norms:
[
{"timestamp": 1700000000.0, "global_step": 500, "epoch": 0.5,
"norms": {"model.layers.0.attention.q.weight": 1.23, "model.layers.0.attention.k.weight": 1.18, ...},
"spectral_norms": {"model.layers.0.attention.q.weight": 0.98, "model.layers.0.attention.k.weight": 0.91, ...}},
...
]
Spectral norms are estimated via power iteration, which is significantly faster than full SVD for large weight matrices. Direction vectors are cached across evaluation steps for warm-starting, improving accuracy over time.
GradNormLogger¶
Writes gradient_norms.json containing per-parameter gradient L2 norms. Gradients are captured after clipping but before the optimizer step (via the on_pre_optimizer_step hook):
[
{"timestamp": 1700000000.0, "global_step": 500, "epoch": 0.5,
"grad_norms": {"model.layers.0.attention.q.weight": 0.012, "model.layers.0.attention.k.weight": 0.045, ...}},
...
]
When fuse_optim_with_backward is enabled, gradients are consumed during the backward pass and this callback disables itself with a warning.
Diagnostic Log Location¶
Diagnostic logs are saved to the same directory as trainer_logs.json:
output_models/MODEL_NAME/runs/RUN_NAME/
trainer_logs.json # Standard training metrics
parameter_norms.json # Per-parameter weight/spectral norms
gradient_norms.json # Per-parameter gradient norms
Per-Parameter Heatmap Plots¶
The forgather plot heatmap command generates grid heatmaps from diagnostic log files, with parameter FQN labels on the y-axis and training steps on the x-axis. Each cell is color-coded by the metric value.
Basic Usage¶
# Auto-detect latest diagnostic log in project
forgather plot heatmap
# Specify a log file
forgather plot heatmap path/to/parameter_norms.json
# Open in editor after generation
forgather plot heatmap -e
The metric type is auto-detected from the file content (norms, spectral_norms, or grad_norms). To override:
# Plot spectral norms from a file that contains both norms and spectral_norms
forgather plot heatmap path/to/parameter_norms.json --metric spectral_norm
Filtering Parameters¶
Use --filter / -f with a regex pattern to show only matching parameter names:
# Only attention parameters
forgather plot heatmap parameter_norms.json -f "attention"
# Only query and key matrices
forgather plot heatmap parameter_norms.json -f "(q_proj|k_proj)"
# Only a specific layer
forgather plot heatmap parameter_norms.json -f "layers\.2\."
# Only MLP parameters across all layers
forgather plot heatmap gradient_norms.json -f "mlp"
Controlling Step Density¶
For long training runs, use --step-stride to reduce the number of plotted steps:
Color Scale Options¶
# Log scale (useful when values span multiple orders of magnitude)
forgather plot heatmap parameter_norms.json --log-scale
# Manual color range
forgather plot heatmap parameter_norms.json --vmin 0.5 --vmax 2.0
Output Options¶
# Custom output path
forgather plot heatmap parameter_norms.json -o figures/spectral_heatmap.png
# SVG format
forgather plot heatmap parameter_norms.json --format svg
# Custom figure size (width height in inches)
forgather plot heatmap parameter_norms.json --figsize 20 30
# Custom title
forgather plot heatmap parameter_norms.json --title "Spectral Norms: BF16 Run"
Typical Workflow¶
# 1. Train with diagnostic callbacks enabled
forgather -t config.yaml train
# 2. List available logs
forgather logs list
# 3. Check overall spectral norms
forgather plot heatmap output_models/my_model/runs/latest/parameter_norms.json --metric spectral_norm -e
# 4. Zoom in on attention layers
forgather plot heatmap output_models/my_model/runs/latest/parameter_norms.json \
--metric spectral_norm -f "attention" -e
# 5. Check gradient norms for the same layers
forgather plot heatmap output_models/my_model/runs/latest/gradient_norms.json \
-f "attention" -e
Programmatic API¶
You can also use the analysis tools programmatically in Python:
from forgather.ml.analysis import TrainingLog, compute_summary_statistics, plot_training_metrics
# Load log
log = TrainingLog.from_file("path/to/trainer_logs.json")
# Get summary statistics
summary = compute_summary_statistics(log)
print(f"Best loss: {summary['best_loss']} at step {summary['best_loss_step']}")
# Generate plots
from forgather.ml.analysis.plotting import plot_loss_curves
fig = plot_loss_curves([log], smooth_window=10, output_path="loss.png")
Heatmap API¶
from forgather.ml.analysis import plot_parameter_heatmap
# Plot spectral norms for attention parameters only
fig = plot_parameter_heatmap(
"path/to/parameter_norms.json",
metric="spectral_norm",
filter_pattern="attention",
step_stride=2,
log_scale=True,
output_path="spectral_heatmap.png",
)
Log Location¶
Training logs are automatically saved to:
For example:
Dependencies¶
The plotting functionality requires matplotlib:
For additional analysis capabilities, pandas is recommended but optional: