Evaluating Models with `forgather eval`¶

The forgather eval subcommand runs a loss/perplexity evaluation against a named dataset and writes the results as JSON alongside the model, plus a human-readable summary to the console.

Unlike forgather train, an eval run does not require you to construct a training project, locate a dataset project, hand-craft a long command line, or dig through runs/ for the loss. Named eval configs (c4, tinystories, ...) are discovered automatically.

Quick start¶

# List available eval configs
forgather eval list

# Show the details of a config
forgather eval show c4

# Evaluate a model on TinyStories with all visible GPUs (DDP)
forgather eval test tinystories -M /path/to/model

# Evaluate on FineWeb-Edu-Dedup using pipeline parallelism for a model that
# doesn't fit on a single GPU
forgather eval test fineweb-edu-dedup \
    -M /path/to/model \
    --trainer pipeline \
    --max-length 4096 \
    --dtype bfloat16

Output¶

Rank 0 writes:

<model_dir>/evals/<eval_name>_<timestamp>/results.json

results.json contains:

{
  "eval_name": "tinystories",
  "config_name": "Eval: TinyStories",
  "model_path": "/path/to/model",
  "checkpoint_path": null,
  "dataset_proj": "/.../examples/datasets/roneneldan",
  "dataset_config": "tinystories-packed.yaml",
  "dataset_target": "test_dataset",
  "batch_size": 16,
  "max_length": 512,
  "dtype": "bfloat16",
  "attn_implementation": "sdpa",
  "trainer": "ddp",
  "world_size": 2,
  "eval_loss": 1.699,
  "perplexity": 5.47,
  "wall_time_s": 12.3,
  "timestamp": "2026-04-18T05:11:34Z"
}

The same information is printed as a summary table at the end of the run.

Subcommands¶

`forgather eval list`¶

Prints one row per discovered eval config: name, description, and the project file that defines it.

`forgather eval show NAME`¶

Prints the resolved metadata for NAME: dataset project, dataset config, target, default batch size / max length. Pass --pp to dump the full preprocessed YAML.

`forgather eval test NAME [flags]`¶

Runs the eval. All flags are optional; defaults come from the config.

Flag	Default	Purpose
`-M`, `--model PATH`	current project's output dir	Model to evaluate
`-d`, `--devices "0,1"`	all visible GPUs	`CUDA_VISIBLE_DEVICES` for the child process
`--trainer {ddp,simple,pipeline}`	`ddp`	Trainer backend (see below)
`--checkpoint PATH`	(none)	Resume from an explicit checkpoint
`--no-checkpoint`	off	Load via `AutoModelForCausalLM.from_pretrained` instead of resuming
`--batch-size N`	config default	Per-device eval batch size
`--max-length N`	config default	Max sequence length
`--stride N`	config default (0)	Packing stride (forwarded to the dataset project)
`--max-steps N`	`-1` (all)	Cap the number of eval batches
`--dtype {bfloat16,float16,float32}`	`bfloat16`	Model dtype
`--attn-implementation NAME`	`sdpa`	`sdpa`, `flash_attention_2`, `flex_attention`, ...
`--compile`	off	`torch.compile` the model
`--output-dir PATH`	`$MODEL`	Override where `evals/` is written
`--dry-run`	off	Print the torchrun command without executing

Trainer backends¶

ddp (default) — DDPTrainer under torchrun --nproc-per-node gpu. Uses every visible GPU unless you pass -d. This is the standard choice for models that fit on a single GPU.
simple — single-process base Trainer. Faster startup, no torchrun. Useful for very small models or debugging.
pipeline — PipelineTrainer under torchrun. Splits the model across GPUs via create_manual_causal_lm_splitter() with a ScheduleGPipe schedule. Use this when a single GPU cannot hold the model. The data collator switches to padding="max_length" for pipeline runs so every microbatch has a matching shape.

Model loading¶

By default the trainer constructs the model on the target device with no_init_weights(), then loads the latest checkpoint under --model (the same mechanism forgather train uses). Pass an explicit path with --checkpoint PATH to pin a specific one, or --no-checkpoint to load via AutoModelForCausalLM.from_pretrained on the model directory.

Quantized models (artifacts produced by forgather finalize --quantize) load transparently through the standard checkpoint-resume path. Forgather's native loader (forgather.ml.sharded_checkpoint.load_checkpoint) detects torchao quantization from config.json's quantization_config block (or falls back to scanning the saved state_dict) and installs the matching quantized linear modules before load_state_dict runs. No extra flag, no caller-side recipe argument. See QAT Training § Evaluating Quantized Models.

The tokenizer is always loaded directly from --model via AutoTokenizer.from_pretrained.

Adding a new eval config¶

Eval configs live under examples/evaluation/. Each is a small YAML file that extends the base eval template test/test_type.yaml (from templatelib/base/test/) and sets a few metadata fields:

-- extends "test/test_type.yaml"

[config_metadata]
    == super()
    -- set ns.config_name = "Eval: My Dataset"
    -- set ns.config_description = "Evaluate on MyDataset test split"
    -- set ns.eval_name = "mydataset"
    -- set ns.dataset_proj = joinpath(ns.forgather_dir, "examples", "datasets", "my_org")
    -- set ns.dataset_config = "my-packed.yaml"
    -- set ns.default_max_length = 1024
    -- set ns.default_batch_size = 8

The base template stamps config_class = "type.evaluation" into the meta block, which is what forgather eval list/show uses to discover configs.

User-level search paths¶

Drop a file at ~/.config/forgather/config.yaml to extend (or replace) the default search path:

eval:
  search_paths:
    - /path/to/my/extra/eval_projects
  # replace_default: true   # drop the builtin examples/evaluation/ path

forgather eval list will pick up configs from every listed directory.

Example: evaluating a pre-trained model across datasets¶

MODEL=/mnt/rust/aiassets/models/qwen3-1.7b-base

for NAME in tinystories c4 openorca fineweb-edu-dedup; do
    forgather eval test "$NAME" -M "$MODEL" --max-length 2048 --batch-size 4
done

# Results land in $MODEL/evals/<name>_<timestamp>/results.json for each.

Evaluating Models with forgather eval¶