Project Index¶
import forgather.nb.notebooks as nb
nb.display_project_index(show_available_templates=True)
Tiny LLama¶
In this tutorial we will train a very small Llama model (about 5M parameters) on 10% of the Tiny Stories dataset. On a single RTX-4090, this takes about three minutes. Once training is complete, we will load the model an use it for text generation -- and the generation will be reasonably coherent for a three-minute-old model.
Project Directory: "/home/dinalt/rust/forgather/examples/tutorials/tiny_llama"¶
Meta Config¶
Meta Config: /home/dinalt/rust/forgather/examples/tutorials/tiny_llama/meta.yaml
Template Search Paths:
- /home/dinalt/rust/forgather/examples/tutorials/tiny_llama/templates
- /home/dinalt/rust/forgather/forgather_workspace
- /home/dinalt/rust/forgather/templatelib/modellib
- /home/dinalt/rust/forgather/templatelib/examples
- /home/dinalt/rust/forgather/templatelib/base
Available Configurations¶
Default Configuration: train_tiny_llama.yaml
Available Templates¶
- base_directories.yaml
- meta_defaults.yaml
- tokenizers/tiny_8k.yaml
- tokenizers/tiny_2k.yaml
- tokenizers/wikitext/32k.yaml
- tokenizers/wikitext/8k.yaml
- flex_kernel_options/default.yaml
- config_type.yaml
- trainers/base_trainer.yaml
- models/base_language_model.yaml
- callbacks/base_callbacks.yaml
This example makes extensive use of the Forgather templates library. Take a look at the various files which go into the configuration and compare these to the pre-processed output.
nb.display_config(config_template="train_tiny_llama.yaml", show_pp_config=True, show_generated_code=False)
Included Templates¶
Config Metadata:¶
{'config_class': 'type.training_script.causal_lm',
'config_description': 'A demo of training a tiny llama model from scratch',
'config_name': 'Tiny Llama',
'datasets_dir': '/home/dinalt/rust/forgather/datasets',
'forgather_dir': '/home/dinalt/rust/forgather',
'logging_dir': './output_models/tiny_llama/runs/log_2026-04-04T01-20-47',
'model_src_dir': '/home/dinalt/rust/forgather/model_src',
'models_dir': './output_models',
'nproc_per_node': 1,
'output_dir': './output_models/tiny_llama',
'project_dir': '.',
'tokenizers_dir': '/home/dinalt/rust/forgather/tokenizers',
'workspace_root': '/home/dinalt/rust/forgather'}
Modules¶
Output Targets¶
- distributed_env
- tokenizer
- model
- tokenizer_args
- train_dataset
- eval_dataset
- data_collator
- experiment_info
- generation_config
- text_gen_callback_args
- trainer_callbacks
- optimizer
- lr_scheduler
- trainer_args
- model_preprocessor
- trainer
- dynamic_args
- meta
- main
Preprocessed Config¶
#---------------------------------------
# Tiny Llama
#---------------------------------------
# 2026-04-04T01:20:47+00:00
# Description: A demo of training a tiny llama model from scratch
# Project Dir: /home/dinalt/rust/forgather/examples/tutorials/tiny_llama
# Current Working Dir: "/home/dinalt/rust/forgather/examples/tutorials/tiny_llama"
# Forgather Config Dir: "/home/dinalt/.config/forgather"
# Model: tiny_llama
# Hostname: hal9000
# Versions:
# python: 3.12.3
# torch: 2.10.0
# transformers: 5.1.0
# accelerate: 1.12.0
############# Config Vars ##############
# ns.forgather_dir: "/home/dinalt/rust/forgather"
# ns.models_dir: "/home/dinalt/rust/forgather/examples/tutorials/tiny_llama/output_models"
# ns.project_model_src_dir: "/home/dinalt/rust/forgather/examples/tutorials/tiny_llama/model_src"
# ns.tokenizers_dir: "/home/dinalt/rust/forgather/tokenizers"
# ns.datasets_dir: "/home/dinalt/rust/forgather/datasets"
# ns.model_src_dir: "/home/dinalt/rust/forgather/model_src"
# ns.output_dir: "./output_models/tiny_llama"
# ns.logging_dir: "./output_models/tiny_llama/runs/log_2026-04-04T01-20-47"
# ns.nproc_per_node: 1
####### Distributed Environment ########
distributed_env: &distributed_env !singleton:forgather.ml.distributed:DistributedEnvironment@distributed_env
backend: cuda:nccl,cpu:gloo
no_accelerator: False
############# Dependencies #############
################ Model #################
# https://huggingface.co/docs/transformers/en/model_doc/auto
.define: &model_constructor_args
# See: https://huggingface.co/docs/transformers/en/attention_interface
attn_implementation: "sdpa"
.define: &model_dict !call:forgather:from_project
project_dir: "/home/dinalt/rust/forgather/examples/models/llama"
config_template: "4M.yaml"
targets: [ "pretrained_tokenizer", "model" ]
pp_kwargs:
output_dir: "./output_models/tiny_llama"
pp_debug: False
model_constructor_args: *model_constructor_args
tokenizer: &tokenizer !call:getitem [ *model_dict, 'pretrained_tokenizer' ]
model: &model !call:getitem [ *model_dict, 'model' ]
############### Datasets ###############
tokenizer_args: &tokenizer_args !dict
truncation: True
max_length: 512
.define: &dataset_dict !call:forgather:from_project
project_dir: "/home/dinalt/rust/forgather/examples/datasets/roneneldan"
config_template: "tinystories-abridged.yaml"
targets: [ "train_dataset", "eval_dataset" ]
preprocess_args: *tokenizer_args
tokenizer: *tokenizer
train_dataset: &train_dataset !call:getitem [ *dataset_dict, 'train_dataset' ]
eval_dataset: &eval_dataset !call:getitem [ *dataset_dict, 'eval_dataset' ]
############ Data Collator #############
# Data collator for causal model
# Batches are dynamically padded to longest sequence
# labels are set to input_ids, with pad tokens set to -100
data_collator: &data_collator !singleton:forgather.ml.data_collator:DataCollatorForCausalLM@DataCollatorForCausalLM
tokenizer: *tokenizer
return_tensors: pt
# Tiny Llama
truncation: True
max_length: 512
########## Trainer Callbacks ###########
# **Dependencies**
.define: &step_columns !dict
.define: &final_metrics !dict
.define: &peak_hardware_flops null
# Resumable TensorBoard SummaryWriter wrapper
.define: &summary_writer !singleton:forgather.ml.trainer.callbacks:ResumableSummaryWriter
log_dir: "./output_models/tiny_llama/runs/log_2026-04-04T01-20-47"
.define: &tb_scalars !dict
# Additional data to record to experiment loggers
experiment_info: &experiment_info !dict:@experiment_info
date: "2026-04-04T01:20:47+00:00"
name: "Tiny Llama"
description: "A demo of training a tiny llama model from scratch"
config: !var "pp_config"
versions: {'python': '3.12.3', 'torch': '2.10.0', 'transformers': '5.1.0', 'accelerate': '1.12.0'}
generation_config: &generation_config !dict:@generation_config
do_sample: True
top_k: 20
temperature: 0.7
repetition_penalty: 1.15
text_gen_callback_args: &text_gen_callback_args
summary_writer: *summary_writer
prompts: /home/dinalt/rust/forgather/prompts/tiny_stories.yaml
generation_config: *generation_config
max_new_tokens: 40
generation_steps: 2000
# **Callback List**
trainer_callbacks: &trainer_callbacks !dlist:@trainer_callbacks
default_metrics: !singleton:forgather.ml.trainer.callbacks:DefaultMetrics
peak_hardware_flops: *peak_hardware_flops
progress_callback: !singleton:forgather.ml.trainer.callbacks:ProgressCallback
use_tqdm: null # Optional[bool] : Use TQDM, Auto, if unspecified
output_stream: "stdout" #Literal["stderr", "stdout"]
step_columns: *step_columns
final_metrics: *final_metrics
info_callback: !singleton:forgather.ml.trainer.callbacks:InfoCallback
verbose: False
# ResumableSummaryWriter registered as callback for checkpoint state persistence
resumable_writer: *summary_writer
# Log all training output to JSON
json_logger: !singleton:forgather.ml.trainer.callbacks:JsonLogger
<<: *experiment_info
# Log configuration and metrics to Tensorboard file
tb_logger: !singleton:forgather.ml.trainer.callbacks:TBLogger
summary_writer: *summary_writer
scalars: *tb_scalars
experiment_info: *experiment_info
text_gen_callback: !singleton:forgather.ml.trainer.callbacks:TextgenCallback
<<: *text_gen_callback_args
# Allow remote control of the training process
trainer_control: !singleton:forgather.ml.trainer.callbacks:TrainerControlCallback
############## Optimizer ###############
optimizer: &optimizer !partial:torch:optim.AdamW
lr: 1.0e-3
############# LR Scheduler #############
# https://arxiv.org/html/2503.02844v1
lr_scheduler: &lr_scheduler !lambda:forgather.ml.optim.infinite_lr_scheduler:InfiniteLRScheduler@lr_scheduler
warmup_steps: 500
cooldown_steps: 50000
constant_lr: 1.0e-6
############# Trainer Args #############
trainer_args: &trainer_args !dict
save_strategy: "no"
max_steps: -1
output_dir: "./output_models/tiny_llama"
logging_dir: "./output_models/tiny_llama/runs/log_2026-04-04T01-20-47"
# Tiny Llama Project Overrides
eval_strategy: "steps"
save_strategy: "steps"
save_steps: 10000
# Safetensors can't handle tied parameters/buffers, so fallback to PyTorch format.
save_safetensors: False
seed: 42
per_device_train_batch_size: 32
per_device_eval_batch_size: 64
logging_steps: 100
eval_steps: 500
num_train_epochs: 1
dataloader_num_workers: 1
# **Trainer Dependencies**
.define: &fused_loss_factory null
############### Trainer ################
# Name: Forgather Trainer
# Description: A lightweight, extensible trainer; does not support multiple GPUs
# Trainer Config Class: forgather.ml.trainer:TrainingArguments
# Trainer Class: forgather.ml.trainer:Trainer
# **Trainer Dependencies**
model_preprocessor: &model_preprocessor !partial:call
- *model
# **Trainer Constructor**
trainer: &trainer !singleton:forgather.ml.trainer:Trainer@trainer
args: *trainer_args
model_init: *model_preprocessor
data_collator: *data_collator
train_dataset: *train_dataset
eval_dataset: *eval_dataset
processing_class: *tokenizer
callbacks: *trainer_callbacks
# **Trainer**
compute_loss_func: !singleton:forgather.ml.loss:CausalLoss
distributed_env: *distributed_env
optimizer_factory: *optimizer
lr_scheduler_factory: *lr_scheduler
fused_loss_factory: *fused_loss_factory
# **Dynamic Args**
dynamic_args: !dlist
null: ~
max_steps:
names: "--max-steps"
type: "int"
help: "Set maximum training steps"
save_strategy:
names: [ "--save-strategy", "-S" ]
choices: [ "no", "steps", "epoch" ]
type: "str"
help: "When to save checkpoints"
no_accelerator:
names: "--no-accelerator"
action: "store_true"
help: "Disable use of accelerator, when available. e.g. 'don't use GPU'"
dist_backend:
names: "--dist-backend"
help: "The name of the torch-distributed backend to use"
output_dir:
names: "--output-dir"
type: path
help: "Overrides default model output directory path"
model_name:
names: "--model-name"
help: "Set custom model name"
log_name:
names: "--log-name"
help: "Set log name prefix"
attn_implementation:
names: "--attn-implementation"
type: "str"
choices: [ "eager", "sdpa", "flash_attention_2", "flex_attention" ]
help: "Attention implementation"
verbose_info:
names: "--verbose-info"
action: "store_true"
help: "Display verbose training info on startup"
#---------------------------------------
# Configuration Output
#---------------------------------------
meta: &meta_output !dict:@meta
config_name: "Tiny Llama"
config_description: "A demo of training a tiny llama model from scratch"
config_class: "type.training_script.causal_lm"
project_dir: "."
workspace_root: "/home/dinalt/rust/forgather"
forgather_dir: "/home/dinalt/rust/forgather"
models_dir: "./output_models"
tokenizers_dir: "/home/dinalt/rust/forgather/tokenizers"
datasets_dir: "/home/dinalt/rust/forgather/datasets"
output_dir: "./output_models/tiny_llama"
model_src_dir: "/home/dinalt/rust/forgather/model_src"
logging_dir: "./output_models/tiny_llama/runs/log_2026-04-04T01-20-47"
nproc_per_node: 1
main: !singleton:forgather.ml.training_script:TrainingScript@training_script
meta: *meta_output
do_train: True
do_save: False
do_eval: False
distributed_env: *distributed_env
trainer: *trainer
Load Project¶
Load the default configuraiton.
from forgather.project import Project
import forgather.nb.notebooks as nb
# Load the default project, which is "train_tiny_llama.yaml"
proj = Project()
Start Tensorboard¶
This project has been configured to log training to Tensorboard (TB). To watch the model's training progress with TB, run the following command, which will generate a CLI command to start the TB server. Then run the command from a shell.
Tensorboard can be started from a terminal like this:
# By default, Tensorboard bind only to localhost. To bind to all interfaces, add --bind_all
tensorboard --logdir "/path/to/model/log/directory" [--bind_all]
You can use the CLI to launch TB for you, where it will automatically determine the path to the log directory:
# --all : Watch all output model directories, otherwise just the one for the current configuration.
# -- : Any arguments after '--' are passed directly to tensorboard, for example "--bind_all"
cd PROJECT_DIR
forgather tb [--all] [-- <tensorboard-args>]
When TB starts, it should provide the URL to access it. e.g.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.16.2 at http://localhost:6006/ (Press CTRL+C to quit)
Train Model¶
You have a few options for training the mode.
- Run it directly from the notebook. This should work find with this example, although for projects using multiple GPUs, you will want to use one of the other options. To train from the notebook, just run the following cell.
- You can generate a training script and run it from the shell. To do so, run the cell with "generate_trainingscript()," then run the generated shell script from a terminal.
- You can use the Forgather CLI.
# Open a shell in thie project's directory, then run this command:
cd PROJECT_DIR
forgather train
# See forgather --help for more details.
Once training starts, switch to Tensorboard in your browser. One of the first things you will want to do is enable automatic refresh. To do so, click the gear in the upper-right corner and check "Reload Data."
Once training has started, take a look at the "Text" tab. You will see that we have automatically logged the preprocessed configuraiton as well as having dumped the primary training artifacts.
Next, switch to the "Scalars" tab. You will see a plot of train and evaluation loss which will automatically update every 30 seconds. If you are not familiar with Tensorboard, now would be a good time to play with the UI elements to see how they work.
When training completes, the model will be automatically saved to the output directory ("./output_models/default_model").
# Train model in notebook.
# Construct the default target, "main," which is a training script.
training_script = proj()
# Start training the model.
training_script.run()
# Release resources
training_script = None
[Rank 0] 2026-04-04 01:21:00,852 - forgather.ml.distributed - INFO - DistributedEnvironment(rank=0, local_rank=0, world_size=1, local_world_size=1, master_addr=localhost, master_port=29501, backend=cuda:nccl,cpu:gloo)
[Rank 0] 2026-04-04 01:21:01,351 - forgather.ml.distributed - INFO - init_process_group(('cuda:nccl,cpu:gloo', 'cuda:0'))
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
0%| …
[INFO|info_logger] total_examples: 317,984 total_train_samples: 317,984 per_device_train_batch_size: 32 actual_per_device_batch_size: 32 total_train_batch_size: 32 max_steps: 9,937 total_parameters: 4.43M trainable_parameters: 4.43M embedding_parameters: 1.02M non_embedding_parameters: 3.41M tied_embeddings: False 2026-04-04 01:21:09 step epoch loss grad lr tokens total_tok peak_mem 2026-04-04 01:21:09 100 0.01006 6.58380 0.7372 2.00e-04 977,120 977K 1.650 GiB 2026-04-04 01:21:12 step epoch loss grad lr tokens total_tok tok/s peak_mem 2026-04-04 01:21:12 200 0.02013 4.05593 0.5052 4.00e-04 944,134 1.92M 341,301 1.650 GiB 2026-04-04 01:21:14 300 0.03019 3.24215 0.5045 6.00e-04 1,028,617 2.95M 469,800 1.650 GiB 2026-04-04 01:21:16 400 0.04025 2.91114 0.4818 8.00e-04 1,024,475 3.97M 468,352 1.650 GiB 2026-04-04 01:21:18 500 0.05032 2.60094 0.4355 1.00e-03 1,008,771 4.98M 476,413 1.650 GiB
12%|##########################2 …
2026-04-04 01:21:19 500 0.05 eval-loss: 2.43377 2026-04-04 01:21:21 600 0.06038 2.42284 0.4414 1.00e-03 985,370 5.97M 360,763 1.650 GiB 2026-04-04 01:21:23 700 0.07044 2.25030 0.4296 1.00e-03 980,241 6.95M 454,486 1.650 GiB 2026-04-04 01:21:26 800 0.08051 2.20510 0.4227 1.00e-03 1,023,598 7.97M 456,808 1.650 GiB 2026-04-04 01:21:28 900 0.09057 2.08785 0.4087 1.00e-03 960,137 8.93M 450,417 1.650 GiB 2026-04-04 01:21:30 1,000 0.1006 1.95900 0.4120 1.00e-03 987,536 9.92M 464,842 1.650 GiB
12%|##########################2 …
2026-04-04 01:21:30 1,000 0.1 eval-loss: 1.86901 2026-04-04 01:21:32 1,100 0.1107 1.96341 0.4126 1.00e-03 1,044,611 11M 408,652 1.650 GiB 2026-04-04 01:21:35 1,200 0.1208 1.93876 0.4174 1.00e-03 1,013,457 12M 463,764 1.650 GiB 2026-04-04 01:21:37 1,300 0.1308 1.90622 0.3983 9.99e-04 979,830 13M 460,834 1.650 GiB 2026-04-04 01:21:39 1,400 0.1409 1.88140 0.3971 9.99e-04 948,560 13.9M 449,250 1.650 GiB 2026-04-04 01:21:41 1,500 0.151 1.85862 0.3996 9.99e-04 982,137 14.9M 453,993 1.650 GiB
12%|##########################2 …
2026-04-04 01:21:41 1,500 0.15 eval-loss: 1.66707 2026-04-04 01:21:44 1,600 0.161 1.82062 0.3938 9.99e-04 984,588 15.9M 393,126 1.650 GiB 2026-04-04 01:21:46 1,700 0.1711 1.78733 0.3900 9.99e-04 1,000,625 16.9M 454,457 1.650 GiB 2026-04-04 01:21:48 1,800 0.1811 1.74879 0.4143 9.98e-04 985,353 17.9M 456,754 1.650 GiB 2026-04-04 01:21:50 1,900 0.1912 1.73327 0.3907 9.98e-04 1,021,953 18.9M 457,547 1.650 GiB 2026-04-04 01:21:52 2,000 0.2013 1.77327 0.3819 9.98e-04 1,008,901 19.9M 461,391 1.650 GiB
12%|##########################2 …
2026-04-04 01:21:53 2,000 0.2 eval-loss: 1.58695 2026-04-04 01:21:55 step epoch loss grad lr tokens total_tok tok/s peak_mem 2026-04-04 01:21:55 2,100 0.2113 1.73090 0.3771 9.97e-04 1,051,113 20.9M 394,401 1.650 GiB 2026-04-04 01:21:57 2,200 0.2214 1.68457 0.3779 9.97e-04 963,855 21.9M 454,356 1.650 GiB 2026-04-04 01:21:59 2,300 0.2315 1.65317 0.3875 9.97e-04 945,462 22.9M 450,080 1.650 GiB 2026-04-04 01:22:01 2,400 0.2415 1.71809 0.3927 9.96e-04 999,813 23.9M 459,945 1.650 GiB 2026-04-04 01:22:04 2,500 0.2516 1.68925 0.3982 9.96e-04 954,586 24.8M 446,709 1.650 GiB
12%|##########################2 …
2026-04-04 01:22:04 2,500 0.25 eval-loss: 1.51825 2026-04-04 01:22:06 2,600 0.2616 1.69354 0.3827 9.96e-04 970,955 25.8M 396,764 1.650 GiB 2026-04-04 01:22:08 2,700 0.2717 1.63991 0.3784 9.95e-04 978,655 26.8M 454,486 1.650 GiB 2026-04-04 01:22:10 2,800 0.2818 1.68507 0.3710 9.95e-04 933,262 27.7M 441,964 1.650 GiB 2026-04-04 01:22:12 2,900 0.2918 1.60311 0.3822 9.94e-04 950,385 28.6M 453,637 1.650 GiB 2026-04-04 01:22:14 3,000 0.3019 1.51511 0.3624 9.94e-04 954,209 29.6M 461,409 1.650 GiB
12%|##########################2 …
2026-04-04 01:22:15 3,000 0.3 eval-loss: 1.43163 2026-04-04 01:22:17 3,100 0.312 1.59901 0.3798 9.93e-04 1,015,095 30.6M 417,453 1.650 GiB 2026-04-04 01:22:19 3,200 0.322 1.67280 0.3619 9.93e-04 945,116 31.6M 441,121 1.650 GiB 2026-04-04 01:22:21 3,300 0.3321 1.58023 0.3551 9.92e-04 1,002,202 32.6M 463,402 1.650 GiB 2026-04-04 01:22:23 3,400 0.3422 1.52237 0.3686 9.92e-04 1,042,299 33.6M 470,880 1.650 GiB 2026-04-04 01:22:26 3,500 0.3522 1.54660 0.3553 9.91e-04 1,098,521 34.7M 479,668 1.650 GiB
12%|##########################2 …
2026-04-04 01:22:26 3,500 0.35 eval-loss: 1.41447 2026-04-04 01:22:28 3,600 0.3623 1.60164 0.3595 9.91e-04 972,015 35.7M 405,238 1.650 GiB 2026-04-04 01:22:30 3,700 0.3723 1.52281 0.3566 9.90e-04 1,016,106 36.7M 461,445 1.650 GiB 2026-04-04 01:22:32 3,800 0.3824 1.53126 0.3587 9.89e-04 992,352 37.7M 461,904 1.650 GiB 2026-04-04 01:22:35 3,900 0.3925 1.57302 0.3533 9.89e-04 1,051,584 38.7M 465,097 1.650 GiB 2026-04-04 01:22:37 4,000 0.4025 1.60683 0.3590 9.88e-04 983,507 39.7M 455,258 1.650 GiB
12%|##########################2 …
2026-04-04 01:22:37 4,000 0.4 eval-loss: 1.38668 2026-04-04 01:22:39 step epoch loss grad lr tokens total_tok tok/s peak_mem 2026-04-04 01:22:39 4,100 0.4126 1.50775 0.3392 9.87e-04 1,007,990 40.7M 381,465 1.650 GiB 2026-04-04 01:22:42 4,200 0.4227 1.49784 0.3532 9.87e-04 1,055,407 41.8M 473,016 1.650 GiB 2026-04-04 01:22:44 4,300 0.4327 1.53139 0.3529 9.86e-04 1,035,389 42.8M 467,892 1.650 GiB 2026-04-04 01:22:46 4,400 0.4428 1.58109 0.3484 9.85e-04 976,315 43.8M 451,903 1.650 GiB 2026-04-04 01:22:48 4,500 0.4529 1.51753 0.3416 9.84e-04 990,970 44.8M 453,052 1.650 GiB
12%|##########################2 …
2026-04-04 01:22:49 4,500 0.45 eval-loss: 1.37496 2026-04-04 01:22:51 4,600 0.4629 1.45882 0.3379 9.84e-04 1,048,563 45.8M 412,760 1.650 GiB 2026-04-04 01:22:53 4,700 0.473 1.46760 0.3439 9.83e-04 991,809 46.8M 458,568 1.650 GiB 2026-04-04 01:22:55 4,800 0.483 1.49624 0.3354 9.82e-04 988,559 47.8M 462,623 1.650 GiB 2026-04-04 01:22:57 4,900 0.4931 1.49080 0.3225 9.81e-04 1,082,177 48.9M 468,770 1.650 GiB 2026-04-04 01:23:00 5,000 0.5032 1.50867 0.3436 9.80e-04 1,051,959 49.9M 466,356 1.650 GiB
12%|##########################2 …
2026-04-04 01:23:00 5,000 0.5 eval-loss: 1.36931 2026-04-04 01:23:02 5,100 0.5132 1.50543 0.3394 9.79e-04 1,012,680 51M 402,136 1.650 GiB 2026-04-04 01:23:04 5,200 0.5233 1.42941 0.3320 9.78e-04 1,032,022 52M 469,334 1.650 GiB 2026-04-04 01:23:07 5,300 0.5334 1.42544 0.3319 9.77e-04 1,154,141 53.1M 499,034 1.650 GiB 2026-04-04 01:23:09 5,400 0.5434 1.45799 0.3352 9.77e-04 962,468 54.1M 448,821 1.650 GiB 2026-04-04 01:23:11 5,500 0.5535 1.44030 0.3290 9.76e-04 1,059,498 55.2M 469,190 1.650 GiB
12%|##########################2 …
2026-04-04 01:23:11 5,500 0.55 eval-loss: 1.32449 2026-04-04 01:23:13 5,600 0.5636 1.47963 0.3336 9.75e-04 941,818 56.1M 405,129 1.650 GiB 2026-04-04 01:23:16 5,700 0.5736 1.49998 0.3277 9.74e-04 1,017,520 57.1M 464,383 1.650 GiB 2026-04-04 01:23:18 5,800 0.5837 1.46225 0.3317 9.73e-04 988,511 58.1M 446,926 1.650 GiB 2026-04-04 01:23:20 5,900 0.5937 1.47305 0.3187 9.72e-04 948,443 59.1M 447,260 1.650 GiB 2026-04-04 01:23:22 6,000 0.6038 1.41290 0.3386 9.70e-04 1,040,889 60.1M 472,924 1.650 GiB
12%|##########################2 …
2026-04-04 01:23:22 6,000 0.6 eval-loss: 1.33538 2026-04-04 01:23:25 step epoch loss grad lr tokens total_tok tok/s peak_mem 2026-04-04 01:23:25 6,100 0.6139 1.39521 0.3352 9.69e-04 1,056,800 61.2M 389,587 1.650 GiB 2026-04-04 01:23:27 6,200 0.6239 1.44381 0.3303 9.68e-04 991,685 62.1M 450,036 1.650 GiB 2026-04-04 01:23:29 6,300 0.634 1.41872 0.3254 9.67e-04 985,151 63.1M 459,805 1.650 GiB 2026-04-04 01:23:31 6,400 0.6441 1.41018 0.3288 9.66e-04 1,044,800 64.2M 470,438 1.650 GiB 2026-04-04 01:23:34 6,500 0.6541 1.42908 0.3282 9.65e-04 998,991 65.2M 459,600 1.650 GiB
12%|##########################2 …
2026-04-04 01:23:34 6,500 0.65 eval-loss: 1.29949 2026-04-04 01:23:36 6,600 0.6642 1.37279 0.3078 9.64e-04 1,016,934 66.2M 411,330 1.650 GiB 2026-04-04 01:23:38 6,700 0.6742 1.43771 0.3318 9.63e-04 1,010,788 67.2M 460,772 1.650 GiB 2026-04-04 01:23:40 6,800 0.6843 1.42978 0.3293 9.61e-04 971,246 68.2M 456,110 1.650 GiB 2026-04-04 01:23:43 6,900 0.6944 1.48542 0.3158 9.60e-04 984,441 69.2M 454,830 1.650 GiB 2026-04-04 01:23:45 7,000 0.7044 1.38269 0.3072 9.59e-04 957,179 70.1M 453,930 1.650 GiB
12%|##########################2 …
2026-04-04 01:23:45 7,000 0.7 eval-loss: 1.27091 2026-04-04 01:23:47 7,100 0.7145 1.41573 0.3055 9.58e-04 988,431 71.1M 402,458 1.650 GiB 2026-04-04 01:23:49 7,200 0.7246 1.38870 0.2973 9.56e-04 1,129,072 72.2M 485,870 1.650 GiB 2026-04-04 01:23:52 7,300 0.7346 1.37279 0.2987 9.55e-04 1,022,653 73.3M 458,701 1.650 GiB 2026-04-04 01:23:54 7,400 0.7447 1.45281 0.3209 9.54e-04 974,351 74.2M 448,924 1.650 GiB 2026-04-04 01:23:56 7,500 0.7548 1.39657 0.3067 9.52e-04 1,063,663 75.3M 474,791 1.650 GiB
12%|##########################2 …
2026-04-04 01:23:56 7,500 0.75 eval-loss: 1.27161 2026-04-04 01:23:59 7,600 0.7648 1.47150 0.3079 9.51e-04 1,020,960 76.3M 408,489 1.650 GiB 2026-04-04 01:24:01 7,700 0.7749 1.39058 0.3175 9.50e-04 981,170 77.3M 457,247 1.650 GiB 2026-04-04 01:24:03 7,800 0.7849 1.37954 0.2828 9.48e-04 1,151,773 78.4M 483,215 1.650 GiB 2026-04-04 01:24:05 7,900 0.795 1.36868 0.3028 9.47e-04 935,597 79.4M 452,097 1.650 GiB 2026-04-04 01:24:07 8,000 0.8051 1.40743 0.3113 9.46e-04 971,612 80.4M 455,085 1.650 GiB
12%|##########################2 …
2026-04-04 01:24:08 8,000 0.81 eval-loss: 1.2793 2026-04-04 01:24:10 step epoch loss grad lr tokens total_tok tok/s peak_mem 2026-04-04 01:24:10 8,100 0.8151 1.37860 0.3200 9.44e-04 1,029,545 81.4M 391,270 1.650 GiB 2026-04-04 01:24:12 8,200 0.8252 1.38861 0.3020 9.43e-04 1,037,570 82.4M 462,757 1.650 GiB 2026-04-04 01:24:14 8,300 0.8353 1.46832 0.3080 9.41e-04 990,171 83.4M 459,060 1.650 GiB 2026-04-04 01:24:16 8,400 0.8453 1.43619 0.3017 9.40e-04 932,435 84.3M 440,667 1.650 GiB 2026-04-04 01:24:19 8,500 0.8554 1.43465 0.3025 9.38e-04 1,013,888 85.4M 459,825 1.650 GiB
12%|##########################2 …
2026-04-04 01:24:19 8,500 0.86 eval-loss: 1.26052 2026-04-04 01:24:21 8,600 0.8655 1.37197 0.3108 9.37e-04 1,077,645 86.4M 431,321 1.650 GiB 2026-04-04 01:24:23 8,700 0.8755 1.40056 0.2922 9.35e-04 1,029,604 87.5M 456,465 1.650 GiB 2026-04-04 01:24:26 8,800 0.8856 1.45904 0.2902 9.34e-04 937,325 88.4M 439,829 1.650 GiB 2026-04-04 01:24:28 8,900 0.8956 1.41836 0.2927 9.32e-04 959,789 89.4M 444,074 1.650 GiB 2026-04-04 01:24:30 9,000 0.9057 1.42501 0.2959 9.30e-04 968,877 90.3M 448,206 1.650 GiB
12%|##########################2 …
2026-04-04 01:24:30 9,000 0.91 eval-loss: 1.25562 2026-04-04 01:24:32 9,100 0.9158 1.33308 0.2859 9.29e-04 1,073,274 91.4M 418,861 1.650 GiB 2026-04-04 01:24:35 9,200 0.9258 1.35947 0.2927 9.27e-04 984,851 92.4M 451,436 1.650 GiB 2026-04-04 01:24:37 9,300 0.9359 1.33587 0.3225 9.26e-04 963,270 93.4M 449,404 1.650 GiB 2026-04-04 01:24:39 9,400 0.946 1.35189 0.2908 9.24e-04 1,104,682 94.5M 474,695 1.650 GiB 2026-04-04 01:24:41 9,500 0.956 1.38012 0.2939 9.22e-04 967,028 95.4M 445,316 1.650 GiB
12%|##########################2 …
2026-04-04 01:24:41 9,500 0.96 eval-loss: 1.2317 2026-04-04 01:24:44 9,600 0.9661 1.36762 0.2951 9.21e-04 994,942 96.4M 413,732 1.650 GiB 2026-04-04 01:24:46 9,700 0.9761 1.33956 0.2964 9.19e-04 1,005,387 97.4M 460,445 1.650 GiB 2026-04-04 01:24:48 9,800 0.9862 1.36810 0.3001 9.17e-04 994,680 98.4M 461,010 1.650 GiB 2026-04-04 01:24:50 9,900 0.9963 1.42175 0.2850 9.15e-04 1,005,062 99.4M 453,116 1.650 GiB
2026-04-04 01:24:51,470 - forgather.ml.trainer.checkpoint_manager - INFO - Saving checkpoint at ./output_models/tiny_llama/checkpoints/checkpoint-9937 2026-04-04 01:24:51,553 - forgather.ml.trainer.checkpoint_coordinator - INFO - Saving checkpoint at ./output_models/tiny_llama/checkpoints/checkpoint-9937 2026-04-04 01:24:51,554 - forgather.ml.trainer.checkpoint_coordinator - DEBUG - Saving GLOBAL component 'optimizer' to ./output_models/tiny_llama/checkpoints/checkpoint-9937/optimizer_state.pt 2026-04-04 01:24:51,621 - forgather.ml.trainer.checkpoint_coordinator - DEBUG - Saving GLOBAL component 'scheduler' to ./output_models/tiny_llama/checkpoints/checkpoint-9937/scheduler_state.pt 2026-04-04 01:24:51,621 - forgather.ml.trainer.checkpoint_coordinator - DEBUG - Saving GLOBAL component 'trainer' to ./output_models/tiny_llama/checkpoints/checkpoint-9937/trainer_state.pt 2026-04-04 01:24:51,622 - forgather.ml.trainer.checkpoint_coordinator - DEBUG - Saving GLOBAL component 'dataset' to ./output_models/tiny_llama/checkpoints/checkpoint-9937/dataset_state.pt 2026-04-04 01:24:51,623 - forgather.ml.trainer.checkpoint_coordinator - DEBUG - Saving PER_RANK component 'rng' (rank 0) to ./output_models/tiny_llama/checkpoints/checkpoint-9937/rng_state_rank_0.pt 2026-04-04 01:24:51,624 - forgather.ml.trainer.checkpoint_coordinator - DEBUG - Saved checkpoint manifest to ./output_models/tiny_llama/checkpoints/checkpoint-9937/checkpoint_manifest.json 2026-04-04 01:24:51,625 - forgather.ml.trainer.checkpoint_manager - DEBUG - Saved 2 callback states 2026-04-04 01:24:51,626 - forgather.ml.trainer.checkpoint_manager - DEBUG - Saved checkpoint metadata to ./output_models/tiny_llama/checkpoints/checkpoint-9937/checkpoint_metadata.pt Server thread error: Event loop stopped before Future completed.
2026-04-04 01:24:51 Training complete: Runtime: 224.03 s Total steps: 9,936 Total samples: 317,952 Effective batch size: 32 Samples/sec: 1419.239 Steps/sec: 44.351 Epoch: 1 Total tokens: 99,425,556 Tokens/sec: 443,805 Total FLOPs: 2.034e+15 FLOPs/sec: 9.081e+12
Test a Model¶
Forgather has a quick way of testing a model with a dataset definition's test_dataset target.
From the shell:
forgather eval test tinystories
...
========================================================================
Evaluation: tinystories (Eval: TinyStories)
------------------------------------------------------------------------
Model: /home/dinalt/rust/forgather/examples/tutorials/tiny_llama/output_models/tiny_llama
Dataset: /home/dinalt/rust/forgather/examples/datasets/roneneldan [tinystories-packed.yaml]
Target: test_dataset
Trainer: ddp (world_size=1)
Batch size: 16 max_length=2048
Dtype: bfloat16 attn=sdpa
========================================================================
...
========================================================================
eval_loss: 1.462601
perplexity: 4.3172
wall_time: 5.84 s
========================================================================
By default, it will test the model in the current configuration's output directory on the latest checkpoint, but you can point it to an arbitrary model directory with --model /path/to/model
Load Trained Model¶
You can use the regular HF APIs to load the saved model and tokenizer.
from forgather.project import Project
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig, StoppingCriteria
from forgather.ml.sharded_checkpoint import create_pretrained_symlinks
import torch
model_path = "./output_models/tiny_llama"
# Create symlinks to latest checkpoint model output directory
# This is required for .from_pretrained() to find the latest checkpoint.
create_pretrained_symlinks(model_path)
# Set device to run inference on
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
Loading weights: 0%| | 0/39 [00:00<?, ?it/s]
The equivalent CLI for creating symbolic links to the latest checkpoint in the output directory is:
forgather checkpoint link
Text Generation¶
This loop will use the newly trained model to generate text, seeded with the above prompts.
import torch
def generate_text(model, tokenizer, prompts, gen_config, max_new_tokens, device):
model.to(device)
model.eval()
with torch.inference_mode():
for prompt in prompts:
tokenizer_outputs = tokenizer(
[prompt],
truncation=False,
return_length=True,
return_tensors="pt",
return_attention_mask=True,
)
input_ids = tokenizer_outputs["input_ids"].to(device)
attention_mask = tokenizer_outputs["attention_mask"].to(device)
outputs = model.generate(
input_ids,
attention_mask=attention_mask,
generation_config=gen_config,
past_key_values=None,
)
output_text = tokenizer.decode(
outputs.sequences[0],
skip_special_tokens=True,
)
yield prompt + " [START] " + output_text[len(prompt) + 1 :]
prompts = [
'Alice was so tired when she got back home so she went',
'Jack and Lily liked to watch the moon at night. They noticed that the moon changed its shape every night. Sometimes the moon was big and round, and sometimes it was',
'Jack and Lily saw a rainbow after a rainy day.They were amazed by the colors. Jack said, "Look, Lily. A rainbow has',
'Jack wanted to read a book, so he went to',
'"Can cows fly?" Alice asked her mother.',
]
gen_config = GenerationConfig(
pad_token_id=model.config.pad_token_id,
bos_token_id=model.config.bos_token_id,
eos_token_id=model.config.eos_token_id,
do_sample=True,
top_k=20,
top_p=0.9,
temperature=0.7,
repitition_penalty=1.15,
max_new_tokens=100,
return_dict_in_generate=True,
)
for s in generate_text(model, tokenizer, prompts, gen_config, 100, "cuda:0"):
print(s)
print(f"{'-' * 40}")
Alice was so tired when she got back home so she went [START] to bed. She looked around and saw a big table. It was tidy and she wanted to take it home. She was so excited and ran to her mom to show her. Mom got the table and asked, "What are you tidying?" Alice smiled and said, "I want to go to bed." Mom ---------------------------------------- Jack and Lily liked to watch the moon at night. They noticed that the moon changed its shape every night. Sometimes the moon was big and round, and sometimes it was [START] full of different colors. One day, Jack was walking towards the moon when he saw a big, shiny, green cat. He wanted to play with it. He asked Lily if she wanted to play with it. Lily said yes and gave him a big hug. Jack smiled and said, "I want to play with the cat!" ---------------------------------------- Jack and Lily saw a rainbow after a rainy day.They were amazed by the colors. Jack said, "Look, Lily. A rainbow has [START] bright blue hair." Lily's eyes lit up. "I want to put on the rainbow," she said. Jack said, "No, I want to put the rainbow in the sky. It might burn you." Lily and Jack were very sad. They did not want to go back to the rainbow. They kept w ---------------------------------------- Jack wanted to read a book, so he went to [START] the park. He saw a big, shiny pencil and he wanted it. He opened it, and he saw a big pencil. He was so excited! He wanted to take it home, but it was too high for him to reach. Jack tried to grab the pencil, but he was too small. He pulled and pushed, and the p ---------------------------------------- "Can cows fly?" Alice asked her mother. [START] "Yes," her mother replied. "Let's go to the market and get some coins." So they went to the market and started to chew it. The market was tall and tall. "Look, Mommy!" she said. "We can't catch it!" Her mother smiled and said, "Yes, ----------------------------------------
Train Hugginface LLama Model¶
Next, let's try training a Llama model using the Huggingface implementation.
Train the model on the CLI
forgather -t train_hf_llama.yaml train
nb.display_config(config_template="train_hf_llama.yaml", show_pp_config=True, show_generated_code=False)
Let's See What Happens...¶
...if we replace the post-layer-norm implementation with a pre-layer-norm implementation. This configuration uses a custom model definition in the custom_models directory.
forgather -t experimental_llama.yaml train
Test Model With the Inference Server¶
There is a simple OpenAI compatible inference server implementation in "tools/inference_server"
To host your newly trained model on the inference server:
# Manual start
forgather inf server -c -m ./output_models/tiny_llama/
# Config with YAML file
forgather inf server ./tiny_llama_server.yaml
From another session, you can perform text completion like this:
# Text completion request
forgather inf client --completion "Once upon a time"
# With manual generation settings
forgather inf client --temperature 0.7 --no-repeat-ngram-size 2 --repetition-penalty 1.2 --top-k 40 --completion "Once upon a time" --max-tokens 512
# From YAML config
forgather inf client ./tiny_llama_client.yaml --completion "Once upon a time"
As the model has not been trained on a chat format, it will not be very good at it, but you can try with:
forgather inf client ./tiny_llama_client.yaml
This server should work with other OpenAI compatible clients as well.
Train on the Full Dataset¶
The examples so far have been limited to training on only the first 10% of the dataset. You can train on the complete dataset with this configuration:
forgather -t full_dataset.yaml train
Try the new v2 Configuration¶
A new, roughly equivalent, configuration has been written, which uses the new LM training template.
This version defaults to using DDP and makes use of token-packing to improve throughput by wasting less space on pad tokens. Note that DDP does not work in a notebook, so you will need to run it from the shell.
forgather -t v2.yaml train