Project Index¶

In [1]:

Copied!

import forgather.nb.notebooks as nb
nb.display_project_index(show_available_templates=True)
import forgather.nb.notebooks as nb
nb.display_project_index(show_available_templates=True)

Tiny LLama¶

In this tutorial we will train a very small Llama model (about 5M parameters) on 10% of the Tiny Stories dataset. On a single RTX-4090, this takes about three minutes. Once training is complete, we will load the model an use it for text generation -- and the generation will be reasonably coherent for a three-minute-old model.

Project Directory: "/home/dinalt/rust/forgather/examples/tutorials/tiny_llama"¶

Meta Config¶

Meta Config: /home/dinalt/rust/forgather/examples/tutorials/tiny_llama/meta.yaml

meta.yaml
- meta_defaults.yaml
  - base_directories.yaml

Template Search Paths:

Available Configurations¶

Default Configuration: train_tiny_llama.yaml

Available Templates¶

This example makes extensive use of the Forgather templates library. Take a look at the various files which go into the configuration and compare these to the pre-processed output.

In [ ]:

Copied!

nb.display_config(config_template="train_tiny_llama.yaml", show_pp_config=True, show_generated_code=False)
nb.display_config(config_template="train_tiny_llama.yaml", show_pp_config=True, show_generated_code=False)

Included Templates¶

configs/train_tiny_llama.yaml
- project.yaml
  - project.logger_config
    - callbacks/loggers.yaml
      - callbacks/base_callbacks.yaml
        
        inc/formatting.jinja
  - training_script/causal_lm/causal_lm.yaml
    - trainers/trainer.yaml
      - trainers/base_trainer.yaml
    - training_script/training_script_type.yaml
      - config_type.yaml
        
        base_directories.yaml

Config Metadata:¶

{'config_class': 'type.training_script.causal_lm',
 'config_description': 'A demo of training a tiny llama model from scratch',
 'config_name': 'Tiny Llama',
 'datasets_dir': '/home/dinalt/rust/forgather/datasets',
 'forgather_dir': '/home/dinalt/rust/forgather',
 'logging_dir': './output_models/tiny_llama/runs/log_2026-04-04T01-20-47',
 'model_src_dir': '/home/dinalt/rust/forgather/model_src',
 'models_dir': './output_models',
 'nproc_per_node': 1,
 'output_dir': './output_models/tiny_llama',
 'project_dir': '.',
 'tokenizers_dir': '/home/dinalt/rust/forgather/tokenizers',
 'workspace_root': '/home/dinalt/rust/forgather'}

Modules¶

Output Targets¶

distributed_env
tokenizer
model
tokenizer_args
train_dataset
eval_dataset
data_collator
experiment_info
generation_config
text_gen_callback_args
trainer_callbacks
optimizer
lr_scheduler
trainer_args
model_preprocessor
trainer
dynamic_args
meta
main

Preprocessed Config¶

#---------------------------------------
#               Tiny Llama               
#---------------------------------------
# 2026-04-04T01:20:47+00:00
# Description: A demo of training a tiny llama model from scratch
# Project Dir: /home/dinalt/rust/forgather/examples/tutorials/tiny_llama
# Current Working Dir: "/home/dinalt/rust/forgather/examples/tutorials/tiny_llama"
# Forgather Config Dir: "/home/dinalt/.config/forgather"
# Model: tiny_llama
# Hostname: hal9000
# Versions:
#     python: 3.12.3
#     torch: 2.10.0
#     transformers: 5.1.0
#     accelerate: 1.12.0

############# Config Vars ##############

# ns.forgather_dir: "/home/dinalt/rust/forgather"
# ns.models_dir: "/home/dinalt/rust/forgather/examples/tutorials/tiny_llama/output_models"
# ns.project_model_src_dir: "/home/dinalt/rust/forgather/examples/tutorials/tiny_llama/model_src"
# ns.tokenizers_dir: "/home/dinalt/rust/forgather/tokenizers"
# ns.datasets_dir: "/home/dinalt/rust/forgather/datasets"
# ns.model_src_dir: "/home/dinalt/rust/forgather/model_src"
# ns.output_dir: "./output_models/tiny_llama"
# ns.logging_dir: "./output_models/tiny_llama/runs/log_2026-04-04T01-20-47"
# ns.nproc_per_node: 1

####### Distributed Environment ########

distributed_env: &distributed_env !singleton:forgather.ml.distributed:DistributedEnvironment@distributed_env
    backend: cuda:nccl,cpu:gloo
    no_accelerator: False

############# Dependencies #############



################ Model #################

# https://huggingface.co/docs/transformers/en/model_doc/auto
.define: &model_constructor_args
    # See: https://huggingface.co/docs/transformers/en/attention_interface
    attn_implementation: "sdpa"

.define: &model_dict !call:forgather:from_project
    project_dir: "/home/dinalt/rust/forgather/examples/models/llama"
    config_template: "4M.yaml"
    targets: [ "pretrained_tokenizer", "model" ] 
    pp_kwargs:
        output_dir: "./output_models/tiny_llama"
    pp_debug: False
    model_constructor_args: *model_constructor_args

tokenizer: &tokenizer !call:getitem [ *model_dict, 'pretrained_tokenizer' ]
model: &model !call:getitem [ *model_dict, 'model' ]

############### Datasets ###############

tokenizer_args: &tokenizer_args !dict
    truncation: True
    max_length: 512    

.define: &dataset_dict !call:forgather:from_project
    project_dir: "/home/dinalt/rust/forgather/examples/datasets/roneneldan"
    config_template: "tinystories-abridged.yaml"
    targets: [ "train_dataset", "eval_dataset" ] 
    preprocess_args: *tokenizer_args
    tokenizer: *tokenizer

train_dataset: &train_dataset !call:getitem [ *dataset_dict, 'train_dataset' ]
eval_dataset: &eval_dataset !call:getitem [ *dataset_dict, 'eval_dataset' ]

############ Data Collator #############

# Data collator for causal model
# Batches are dynamically padded to longest sequence
# labels are set to input_ids, with pad tokens set to -100
data_collator: &data_collator !singleton:forgather.ml.data_collator:DataCollatorForCausalLM@DataCollatorForCausalLM
    tokenizer: *tokenizer
    return_tensors: pt

    # Tiny Llama
    truncation: True
    max_length: 512

########## Trainer Callbacks ###########

# **Dependencies**

.define: &step_columns !dict

.define: &final_metrics !dict

.define: &peak_hardware_flops null

# Resumable TensorBoard SummaryWriter wrapper
.define: &summary_writer !singleton:forgather.ml.trainer.callbacks:ResumableSummaryWriter
    log_dir: "./output_models/tiny_llama/runs/log_2026-04-04T01-20-47"

.define: &tb_scalars !dict

# Additional data to record to experiment loggers
experiment_info: &experiment_info !dict:@experiment_info
    date: "2026-04-04T01:20:47+00:00"
    name: "Tiny Llama"
    description: "A demo of training a tiny llama model from scratch"
    config: !var "pp_config"
    versions: {'python': '3.12.3', 'torch': '2.10.0', 'transformers': '5.1.0', 'accelerate': '1.12.0'}





generation_config: &generation_config !dict:@generation_config
    do_sample: True
    top_k: 20
    temperature: 0.7
    repetition_penalty: 1.15

text_gen_callback_args: &text_gen_callback_args
    summary_writer: *summary_writer
    prompts: /home/dinalt/rust/forgather/prompts/tiny_stories.yaml
    generation_config: *generation_config
    max_new_tokens: 40
    generation_steps: 2000


# **Callback List**

trainer_callbacks: &trainer_callbacks !dlist:@trainer_callbacks
    default_metrics: !singleton:forgather.ml.trainer.callbacks:DefaultMetrics
        peak_hardware_flops: *peak_hardware_flops
    progress_callback: !singleton:forgather.ml.trainer.callbacks:ProgressCallback
        use_tqdm: null # Optional[bool] : Use TQDM, Auto, if unspecified
        output_stream: "stdout" #Literal["stderr", "stdout"]
        step_columns: *step_columns
        final_metrics: *final_metrics
    info_callback: !singleton:forgather.ml.trainer.callbacks:InfoCallback
        verbose: False
    # ResumableSummaryWriter registered as callback for checkpoint state persistence
    resumable_writer: *summary_writer

    # Log all training output to JSON
    json_logger: !singleton:forgather.ml.trainer.callbacks:JsonLogger
        <<: *experiment_info

    # Log configuration and metrics to Tensorboard file
    tb_logger: !singleton:forgather.ml.trainer.callbacks:TBLogger
        summary_writer: *summary_writer
        scalars: *tb_scalars
        experiment_info: *experiment_info

    text_gen_callback: !singleton:forgather.ml.trainer.callbacks:TextgenCallback
        <<: *text_gen_callback_args
    # Allow remote control of the training process
    trainer_control: !singleton:forgather.ml.trainer.callbacks:TrainerControlCallback

############## Optimizer ###############

optimizer: &optimizer !partial:torch:optim.AdamW
    lr: 1.0e-3

############# LR Scheduler #############

# https://arxiv.org/html/2503.02844v1
lr_scheduler: &lr_scheduler !lambda:forgather.ml.optim.infinite_lr_scheduler:InfiniteLRScheduler@lr_scheduler
    warmup_steps: 500
    cooldown_steps: 50000
    constant_lr: 1.0e-6

############# Trainer Args #############

trainer_args: &trainer_args !dict
    save_strategy: "no"
    max_steps: -1
    output_dir: "./output_models/tiny_llama"
    logging_dir: "./output_models/tiny_llama/runs/log_2026-04-04T01-20-47"

    # Tiny Llama Project Overrides
    eval_strategy: "steps"
    save_strategy: "steps"
    save_steps: 10000
    # Safetensors can't handle tied parameters/buffers, so fallback to PyTorch format.
    save_safetensors: False
    seed: 42
    per_device_train_batch_size: 32
    per_device_eval_batch_size: 64
    logging_steps: 100
    eval_steps: 500
    num_train_epochs: 1
    dataloader_num_workers: 1

# **Trainer Dependencies**

.define: &fused_loss_factory null    

############### Trainer ################

# Name: Forgather Trainer
# Description: A lightweight, extensible trainer; does not support multiple GPUs
# Trainer Config Class: forgather.ml.trainer:TrainingArguments
# Trainer Class: forgather.ml.trainer:Trainer

# **Trainer Dependencies**



model_preprocessor: &model_preprocessor !partial:call
    - *model

# **Trainer Constructor**

trainer: &trainer !singleton:forgather.ml.trainer:Trainer@trainer
    args: *trainer_args
    model_init: *model_preprocessor
    data_collator: *data_collator
    train_dataset: *train_dataset
    eval_dataset: *eval_dataset
    processing_class: *tokenizer
    callbacks: *trainer_callbacks
    
    # **Trainer**
    compute_loss_func: !singleton:forgather.ml.loss:CausalLoss
    distributed_env: *distributed_env
    optimizer_factory: *optimizer
    lr_scheduler_factory: *lr_scheduler
    fused_loss_factory: *fused_loss_factory

# **Dynamic Args**
dynamic_args: !dlist
    null: ~
    max_steps:
        names: "--max-steps"
        type: "int"
        help: "Set maximum training steps"
    save_strategy:
        names: [ "--save-strategy", "-S" ]
        choices: [ "no", "steps", "epoch" ]
        type: "str"
        help: "When to save checkpoints"
    no_accelerator:
        names: "--no-accelerator"
        action: "store_true"
        help: "Disable use of accelerator, when available. e.g. 'don't use GPU'"
    dist_backend:
        names: "--dist-backend"
        help: "The name of the torch-distributed backend to use"
    output_dir:
        names: "--output-dir"
        type: path
        help: "Overrides default model output directory path"
    model_name:
        names: "--model-name"
        help: "Set custom model name"
    log_name:
        names: "--log-name"
        help: "Set log name prefix"
    attn_implementation:
        names: "--attn-implementation"
        type: "str"
        choices: [ "eager", "sdpa", "flash_attention_2", "flex_attention" ]
        help: "Attention implementation"
    verbose_info:
        names: "--verbose-info"
        action: "store_true"
        help: "Display verbose training info on startup"

#---------------------------------------
#          Configuration Output          
#---------------------------------------
meta: &meta_output !dict:@meta
    config_name: "Tiny Llama"
    config_description: "A demo of training a tiny llama model from scratch"
    config_class: "type.training_script.causal_lm"
    project_dir: "."
    workspace_root: "/home/dinalt/rust/forgather"
    forgather_dir: "/home/dinalt/rust/forgather"
    models_dir: "./output_models"
    tokenizers_dir: "/home/dinalt/rust/forgather/tokenizers"
    datasets_dir: "/home/dinalt/rust/forgather/datasets"
    output_dir: "./output_models/tiny_llama"
    model_src_dir: "/home/dinalt/rust/forgather/model_src"
    logging_dir: "./output_models/tiny_llama/runs/log_2026-04-04T01-20-47"
    nproc_per_node: 1

main: !singleton:forgather.ml.training_script:TrainingScript@training_script
    meta: *meta_output
    do_train: True
    do_save: False
    do_eval: False
    distributed_env: *distributed_env
    trainer: *trainer

Load Project¶

Load the default configuraiton.

In [3]:

Copied!

from forgather.project import Project
import forgather.nb.notebooks as nb

# Load the default project, which is "train_tiny_llama.yaml"
proj = Project()
from forgather.project import Project
import forgather.nb.notebooks as nb

# Load the default project, which is "train_tiny_llama.yaml"
proj = Project()

Start Tensorboard¶

This project has been configured to log training to Tensorboard (TB). To watch the model's training progress with TB, run the following command, which will generate a CLI command to start the TB server. Then run the command from a shell.

Tensorboard can be started from a terminal like this:

# By default, Tensorboard bind only to localhost. To bind to all interfaces, add --bind_all
tensorboard --logdir "/path/to/model/log/directory" [--bind_all]

You can use the CLI to launch TB for you, where it will automatically determine the path to the log directory:

# --all : Watch all output model directories, otherwise just the one for the current configuration.
# -- : Any arguments after '--' are passed directly to tensorboard, for example "--bind_all"
cd PROJECT_DIR
forgather tb [--all] [-- <tensorboard-args>]

When TB starts, it should provide the URL to access it. e.g.

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.16.2 at http://localhost:6006/ (Press CTRL+C to quit)

Train Model¶

You have a few options for training the mode.

Run it directly from the notebook. This should work find with this example, although for projects using multiple GPUs, you will want to use one of the other options. To train from the notebook, just run the following cell.
You can generate a training script and run it from the shell. To do so, run the cell with "generate_trainingscript()," then run the generated shell script from a terminal.
You can use the Forgather CLI.

# Open a shell in thie project's directory, then run this command:
cd PROJECT_DIR
forgather train

# See forgather --help for more details.

Once training starts, switch to Tensorboard in your browser. One of the first things you will want to do is enable automatic refresh. To do so, click the gear in the upper-right corner and check "Reload Data."

Once training has started, take a look at the "Text" tab. You will see that we have automatically logged the preprocessed configuraiton as well as having dumped the primary training artifacts.

Next, switch to the "Scalars" tab. You will see a plot of train and evaluation loss which will automatically update every 30 seconds. If you are not familiar with Tensorboard, now would be a good time to play with the UI elements to see how they work.

When training completes, the model will be automatically saved to the output directory ("./output_models/default_model").

In [4]:

Copied!

# Train model in notebook.

# Construct the default target, "main," which is a training script.
training_script = proj()

# Start training the model.
training_script.run()

# Release resources
training_script = None
# Train model in notebook.

# Construct the default target, "main," which is a training script.
training_script = proj()

# Start training the model.
training_script.run()

# Release resources
training_script = None

[Rank 0] 2026-04-04 01:21:00,852 - forgather.ml.distributed - INFO - DistributedEnvironment(rank=0, local_rank=0, world_size=1, local_world_size=1, master_addr=localhost, master_port=29501, backend=cuda:nccl,cpu:gloo)
[Rank 0] 2026-04-04 01:21:01,351 - forgather.ml.distributed - INFO - init_process_group(('cuda:nccl,cpu:gloo', 'cuda:0'))

[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0

  0%|                                                                                                         …

[INFO|info_logger] 
total_examples: 317,984
total_train_samples: 317,984
per_device_train_batch_size: 32
actual_per_device_batch_size: 32
total_train_batch_size: 32
max_steps: 9,937
total_parameters: 4.43M
trainable_parameters: 4.43M
embedding_parameters: 1.02M
non_embedding_parameters: 3.41M
tied_embeddings: False

2026-04-04 01:21:09         step       epoch      loss      grad          lr      tokens   total_tok     peak_mem
2026-04-04 01:21:09          100     0.01006   6.58380    0.7372    2.00e-04     977,120        977K    1.650 GiB
2026-04-04 01:21:12         step       epoch      loss      grad          lr      tokens   total_tok       tok/s     peak_mem
2026-04-04 01:21:12          200     0.02013   4.05593    0.5052    4.00e-04     944,134       1.92M     341,301    1.650 GiB
2026-04-04 01:21:14          300     0.03019   3.24215    0.5045    6.00e-04   1,028,617       2.95M     469,800    1.650 GiB
2026-04-04 01:21:16          400     0.04025   2.91114    0.4818    8.00e-04   1,024,475       3.97M     468,352    1.650 GiB
2026-04-04 01:21:18          500     0.05032   2.60094    0.4355    1.00e-03   1,008,771       4.98M     476,413    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:21:19          500  0.05   eval-loss: 2.43377
2026-04-04 01:21:21          600     0.06038   2.42284    0.4414    1.00e-03     985,370       5.97M     360,763    1.650 GiB
2026-04-04 01:21:23          700     0.07044   2.25030    0.4296    1.00e-03     980,241       6.95M     454,486    1.650 GiB
2026-04-04 01:21:26          800     0.08051   2.20510    0.4227    1.00e-03   1,023,598       7.97M     456,808    1.650 GiB
2026-04-04 01:21:28          900     0.09057   2.08785    0.4087    1.00e-03     960,137       8.93M     450,417    1.650 GiB
2026-04-04 01:21:30        1,000      0.1006   1.95900    0.4120    1.00e-03     987,536       9.92M     464,842    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:21:30        1,000  0.1    eval-loss: 1.86901
2026-04-04 01:21:32        1,100      0.1107   1.96341    0.4126    1.00e-03   1,044,611         11M     408,652    1.650 GiB
2026-04-04 01:21:35        1,200      0.1208   1.93876    0.4174    1.00e-03   1,013,457         12M     463,764    1.650 GiB
2026-04-04 01:21:37        1,300      0.1308   1.90622    0.3983    9.99e-04     979,830         13M     460,834    1.650 GiB
2026-04-04 01:21:39        1,400      0.1409   1.88140    0.3971    9.99e-04     948,560       13.9M     449,250    1.650 GiB
2026-04-04 01:21:41        1,500       0.151   1.85862    0.3996    9.99e-04     982,137       14.9M     453,993    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:21:41        1,500  0.15   eval-loss: 1.66707
2026-04-04 01:21:44        1,600       0.161   1.82062    0.3938    9.99e-04     984,588       15.9M     393,126    1.650 GiB
2026-04-04 01:21:46        1,700      0.1711   1.78733    0.3900    9.99e-04   1,000,625       16.9M     454,457    1.650 GiB
2026-04-04 01:21:48        1,800      0.1811   1.74879    0.4143    9.98e-04     985,353       17.9M     456,754    1.650 GiB
2026-04-04 01:21:50        1,900      0.1912   1.73327    0.3907    9.98e-04   1,021,953       18.9M     457,547    1.650 GiB
2026-04-04 01:21:52        2,000      0.2013   1.77327    0.3819    9.98e-04   1,008,901       19.9M     461,391    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:21:53        2,000  0.2    eval-loss: 1.58695
2026-04-04 01:21:55         step       epoch      loss      grad          lr      tokens   total_tok       tok/s     peak_mem
2026-04-04 01:21:55        2,100      0.2113   1.73090    0.3771    9.97e-04   1,051,113       20.9M     394,401    1.650 GiB
2026-04-04 01:21:57        2,200      0.2214   1.68457    0.3779    9.97e-04     963,855       21.9M     454,356    1.650 GiB
2026-04-04 01:21:59        2,300      0.2315   1.65317    0.3875    9.97e-04     945,462       22.9M     450,080    1.650 GiB
2026-04-04 01:22:01        2,400      0.2415   1.71809    0.3927    9.96e-04     999,813       23.9M     459,945    1.650 GiB
2026-04-04 01:22:04        2,500      0.2516   1.68925    0.3982    9.96e-04     954,586       24.8M     446,709    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:22:04        2,500  0.25   eval-loss: 1.51825
2026-04-04 01:22:06        2,600      0.2616   1.69354    0.3827    9.96e-04     970,955       25.8M     396,764    1.650 GiB
2026-04-04 01:22:08        2,700      0.2717   1.63991    0.3784    9.95e-04     978,655       26.8M     454,486    1.650 GiB
2026-04-04 01:22:10        2,800      0.2818   1.68507    0.3710    9.95e-04     933,262       27.7M     441,964    1.650 GiB
2026-04-04 01:22:12        2,900      0.2918   1.60311    0.3822    9.94e-04     950,385       28.6M     453,637    1.650 GiB
2026-04-04 01:22:14        3,000      0.3019   1.51511    0.3624    9.94e-04     954,209       29.6M     461,409    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:22:15        3,000  0.3    eval-loss: 1.43163
2026-04-04 01:22:17        3,100       0.312   1.59901    0.3798    9.93e-04   1,015,095       30.6M     417,453    1.650 GiB
2026-04-04 01:22:19        3,200       0.322   1.67280    0.3619    9.93e-04     945,116       31.6M     441,121    1.650 GiB
2026-04-04 01:22:21        3,300      0.3321   1.58023    0.3551    9.92e-04   1,002,202       32.6M     463,402    1.650 GiB
2026-04-04 01:22:23        3,400      0.3422   1.52237    0.3686    9.92e-04   1,042,299       33.6M     470,880    1.650 GiB
2026-04-04 01:22:26        3,500      0.3522   1.54660    0.3553    9.91e-04   1,098,521       34.7M     479,668    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:22:26        3,500  0.35   eval-loss: 1.41447
2026-04-04 01:22:28        3,600      0.3623   1.60164    0.3595    9.91e-04     972,015       35.7M     405,238    1.650 GiB
2026-04-04 01:22:30        3,700      0.3723   1.52281    0.3566    9.90e-04   1,016,106       36.7M     461,445    1.650 GiB
2026-04-04 01:22:32        3,800      0.3824   1.53126    0.3587    9.89e-04     992,352       37.7M     461,904    1.650 GiB
2026-04-04 01:22:35        3,900      0.3925   1.57302    0.3533    9.89e-04   1,051,584       38.7M     465,097    1.650 GiB
2026-04-04 01:22:37        4,000      0.4025   1.60683    0.3590    9.88e-04     983,507       39.7M     455,258    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:22:37        4,000  0.4    eval-loss: 1.38668
2026-04-04 01:22:39         step       epoch      loss      grad          lr      tokens   total_tok       tok/s     peak_mem
2026-04-04 01:22:39        4,100      0.4126   1.50775    0.3392    9.87e-04   1,007,990       40.7M     381,465    1.650 GiB
2026-04-04 01:22:42        4,200      0.4227   1.49784    0.3532    9.87e-04   1,055,407       41.8M     473,016    1.650 GiB
2026-04-04 01:22:44        4,300      0.4327   1.53139    0.3529    9.86e-04   1,035,389       42.8M     467,892    1.650 GiB
2026-04-04 01:22:46        4,400      0.4428   1.58109    0.3484    9.85e-04     976,315       43.8M     451,903    1.650 GiB
2026-04-04 01:22:48        4,500      0.4529   1.51753    0.3416    9.84e-04     990,970       44.8M     453,052    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:22:49        4,500  0.45   eval-loss: 1.37496
2026-04-04 01:22:51        4,600      0.4629   1.45882    0.3379    9.84e-04   1,048,563       45.8M     412,760    1.650 GiB
2026-04-04 01:22:53        4,700       0.473   1.46760    0.3439    9.83e-04     991,809       46.8M     458,568    1.650 GiB
2026-04-04 01:22:55        4,800       0.483   1.49624    0.3354    9.82e-04     988,559       47.8M     462,623    1.650 GiB
2026-04-04 01:22:57        4,900      0.4931   1.49080    0.3225    9.81e-04   1,082,177       48.9M     468,770    1.650 GiB
2026-04-04 01:23:00        5,000      0.5032   1.50867    0.3436    9.80e-04   1,051,959       49.9M     466,356    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:23:00        5,000  0.5    eval-loss: 1.36931
2026-04-04 01:23:02        5,100      0.5132   1.50543    0.3394    9.79e-04   1,012,680         51M     402,136    1.650 GiB
2026-04-04 01:23:04        5,200      0.5233   1.42941    0.3320    9.78e-04   1,032,022         52M     469,334    1.650 GiB
2026-04-04 01:23:07        5,300      0.5334   1.42544    0.3319    9.77e-04   1,154,141       53.1M     499,034    1.650 GiB
2026-04-04 01:23:09        5,400      0.5434   1.45799    0.3352    9.77e-04     962,468       54.1M     448,821    1.650 GiB
2026-04-04 01:23:11        5,500      0.5535   1.44030    0.3290    9.76e-04   1,059,498       55.2M     469,190    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:23:11        5,500  0.55   eval-loss: 1.32449
2026-04-04 01:23:13        5,600      0.5636   1.47963    0.3336    9.75e-04     941,818       56.1M     405,129    1.650 GiB
2026-04-04 01:23:16        5,700      0.5736   1.49998    0.3277    9.74e-04   1,017,520       57.1M     464,383    1.650 GiB
2026-04-04 01:23:18        5,800      0.5837   1.46225    0.3317    9.73e-04     988,511       58.1M     446,926    1.650 GiB
2026-04-04 01:23:20        5,900      0.5937   1.47305    0.3187    9.72e-04     948,443       59.1M     447,260    1.650 GiB
2026-04-04 01:23:22        6,000      0.6038   1.41290    0.3386    9.70e-04   1,040,889       60.1M     472,924    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:23:22        6,000  0.6    eval-loss: 1.33538
2026-04-04 01:23:25         step       epoch      loss      grad          lr      tokens   total_tok       tok/s     peak_mem
2026-04-04 01:23:25        6,100      0.6139   1.39521    0.3352    9.69e-04   1,056,800       61.2M     389,587    1.650 GiB
2026-04-04 01:23:27        6,200      0.6239   1.44381    0.3303    9.68e-04     991,685       62.1M     450,036    1.650 GiB
2026-04-04 01:23:29        6,300       0.634   1.41872    0.3254    9.67e-04     985,151       63.1M     459,805    1.650 GiB
2026-04-04 01:23:31        6,400      0.6441   1.41018    0.3288    9.66e-04   1,044,800       64.2M     470,438    1.650 GiB
2026-04-04 01:23:34        6,500      0.6541   1.42908    0.3282    9.65e-04     998,991       65.2M     459,600    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:23:34        6,500  0.65   eval-loss: 1.29949
2026-04-04 01:23:36        6,600      0.6642   1.37279    0.3078    9.64e-04   1,016,934       66.2M     411,330    1.650 GiB
2026-04-04 01:23:38        6,700      0.6742   1.43771    0.3318    9.63e-04   1,010,788       67.2M     460,772    1.650 GiB
2026-04-04 01:23:40        6,800      0.6843   1.42978    0.3293    9.61e-04     971,246       68.2M     456,110    1.650 GiB
2026-04-04 01:23:43        6,900      0.6944   1.48542    0.3158    9.60e-04     984,441       69.2M     454,830    1.650 GiB
2026-04-04 01:23:45        7,000      0.7044   1.38269    0.3072    9.59e-04     957,179       70.1M     453,930    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:23:45        7,000  0.7    eval-loss: 1.27091
2026-04-04 01:23:47        7,100      0.7145   1.41573    0.3055    9.58e-04     988,431       71.1M     402,458    1.650 GiB
2026-04-04 01:23:49        7,200      0.7246   1.38870    0.2973    9.56e-04   1,129,072       72.2M     485,870    1.650 GiB
2026-04-04 01:23:52        7,300      0.7346   1.37279    0.2987    9.55e-04   1,022,653       73.3M     458,701    1.650 GiB
2026-04-04 01:23:54        7,400      0.7447   1.45281    0.3209    9.54e-04     974,351       74.2M     448,924    1.650 GiB
2026-04-04 01:23:56        7,500      0.7548   1.39657    0.3067    9.52e-04   1,063,663       75.3M     474,791    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:23:56        7,500  0.75   eval-loss: 1.27161
2026-04-04 01:23:59        7,600      0.7648   1.47150    0.3079    9.51e-04   1,020,960       76.3M     408,489    1.650 GiB
2026-04-04 01:24:01        7,700      0.7749   1.39058    0.3175    9.50e-04     981,170       77.3M     457,247    1.650 GiB
2026-04-04 01:24:03        7,800      0.7849   1.37954    0.2828    9.48e-04   1,151,773       78.4M     483,215    1.650 GiB
2026-04-04 01:24:05        7,900       0.795   1.36868    0.3028    9.47e-04     935,597       79.4M     452,097    1.650 GiB
2026-04-04 01:24:07        8,000      0.8051   1.40743    0.3113    9.46e-04     971,612       80.4M     455,085    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:24:08        8,000  0.81   eval-loss: 1.2793
2026-04-04 01:24:10         step       epoch      loss      grad          lr      tokens   total_tok       tok/s     peak_mem
2026-04-04 01:24:10        8,100      0.8151   1.37860    0.3200    9.44e-04   1,029,545       81.4M     391,270    1.650 GiB
2026-04-04 01:24:12        8,200      0.8252   1.38861    0.3020    9.43e-04   1,037,570       82.4M     462,757    1.650 GiB
2026-04-04 01:24:14        8,300      0.8353   1.46832    0.3080    9.41e-04     990,171       83.4M     459,060    1.650 GiB
2026-04-04 01:24:16        8,400      0.8453   1.43619    0.3017    9.40e-04     932,435       84.3M     440,667    1.650 GiB
2026-04-04 01:24:19        8,500      0.8554   1.43465    0.3025    9.38e-04   1,013,888       85.4M     459,825    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:24:19        8,500  0.86   eval-loss: 1.26052
2026-04-04 01:24:21        8,600      0.8655   1.37197    0.3108    9.37e-04   1,077,645       86.4M     431,321    1.650 GiB
2026-04-04 01:24:23        8,700      0.8755   1.40056    0.2922    9.35e-04   1,029,604       87.5M     456,465    1.650 GiB
2026-04-04 01:24:26        8,800      0.8856   1.45904    0.2902    9.34e-04     937,325       88.4M     439,829    1.650 GiB
2026-04-04 01:24:28        8,900      0.8956   1.41836    0.2927    9.32e-04     959,789       89.4M     444,074    1.650 GiB
2026-04-04 01:24:30        9,000      0.9057   1.42501    0.2959    9.30e-04     968,877       90.3M     448,206    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:24:30        9,000  0.91   eval-loss: 1.25562
2026-04-04 01:24:32        9,100      0.9158   1.33308    0.2859    9.29e-04   1,073,274       91.4M     418,861    1.650 GiB
2026-04-04 01:24:35        9,200      0.9258   1.35947    0.2927    9.27e-04     984,851       92.4M     451,436    1.650 GiB
2026-04-04 01:24:37        9,300      0.9359   1.33587    0.3225    9.26e-04     963,270       93.4M     449,404    1.650 GiB
2026-04-04 01:24:39        9,400       0.946   1.35189    0.2908    9.24e-04   1,104,682       94.5M     474,695    1.650 GiB
2026-04-04 01:24:41        9,500       0.956   1.38012    0.2939    9.22e-04     967,028       95.4M     445,316    1.650 GiB

 12%|##########################2                                                                              …

2026-04-04 01:24:41        9,500  0.96   eval-loss: 1.2317
2026-04-04 01:24:44        9,600      0.9661   1.36762    0.2951    9.21e-04     994,942       96.4M     413,732    1.650 GiB
2026-04-04 01:24:46        9,700      0.9761   1.33956    0.2964    9.19e-04   1,005,387       97.4M     460,445    1.650 GiB
2026-04-04 01:24:48        9,800      0.9862   1.36810    0.3001    9.17e-04     994,680       98.4M     461,010    1.650 GiB
2026-04-04 01:24:50        9,900      0.9963   1.42175    0.2850    9.15e-04   1,005,062       99.4M     453,116    1.650 GiB

2026-04-04 01:24:51,470 - forgather.ml.trainer.checkpoint_manager - INFO - Saving checkpoint at ./output_models/tiny_llama/checkpoints/checkpoint-9937
2026-04-04 01:24:51,553 - forgather.ml.trainer.checkpoint_coordinator - INFO - Saving checkpoint at ./output_models/tiny_llama/checkpoints/checkpoint-9937
2026-04-04 01:24:51,554 - forgather.ml.trainer.checkpoint_coordinator - DEBUG - Saving GLOBAL component 'optimizer' to ./output_models/tiny_llama/checkpoints/checkpoint-9937/optimizer_state.pt
2026-04-04 01:24:51,621 - forgather.ml.trainer.checkpoint_coordinator - DEBUG - Saving GLOBAL component 'scheduler' to ./output_models/tiny_llama/checkpoints/checkpoint-9937/scheduler_state.pt
2026-04-04 01:24:51,621 - forgather.ml.trainer.checkpoint_coordinator - DEBUG - Saving GLOBAL component 'trainer' to ./output_models/tiny_llama/checkpoints/checkpoint-9937/trainer_state.pt
2026-04-04 01:24:51,622 - forgather.ml.trainer.checkpoint_coordinator - DEBUG - Saving GLOBAL component 'dataset' to ./output_models/tiny_llama/checkpoints/checkpoint-9937/dataset_state.pt
2026-04-04 01:24:51,623 - forgather.ml.trainer.checkpoint_coordinator - DEBUG - Saving PER_RANK component 'rng' (rank 0) to ./output_models/tiny_llama/checkpoints/checkpoint-9937/rng_state_rank_0.pt
2026-04-04 01:24:51,624 - forgather.ml.trainer.checkpoint_coordinator - DEBUG - Saved checkpoint manifest to ./output_models/tiny_llama/checkpoints/checkpoint-9937/checkpoint_manifest.json
2026-04-04 01:24:51,625 - forgather.ml.trainer.checkpoint_manager - DEBUG - Saved 2 callback states
2026-04-04 01:24:51,626 - forgather.ml.trainer.checkpoint_manager - DEBUG - Saved checkpoint metadata to ./output_models/tiny_llama/checkpoints/checkpoint-9937/checkpoint_metadata.pt
Server thread error: Event loop stopped before Future completed.

2026-04-04 01:24:51   Training complete:
  Runtime:                     224.03 s
  Total steps:                 9,936
  Total samples:               317,952
  Effective batch size:        32
  Samples/sec:                 1419.239
  Steps/sec:                   44.351
  Epoch:                       1
  Total tokens:                99,425,556
  Tokens/sec:                  443,805
  Total FLOPs:                 2.034e+15
  FLOPs/sec:                   9.081e+12

Test a Model¶

Forgather has a quick way of testing a model with a dataset definition's test_dataset target.

From the shell:

forgather eval test tinystories
...
========================================================================
Evaluation: tinystories  (Eval: TinyStories)
------------------------------------------------------------------------
Model:            /home/dinalt/rust/forgather/examples/tutorials/tiny_llama/output_models/tiny_llama
Dataset:          /home/dinalt/rust/forgather/examples/datasets/roneneldan  [tinystories-packed.yaml]
Target:           test_dataset
Trainer:          ddp  (world_size=1)
Batch size:       16  max_length=2048
Dtype:            bfloat16  attn=sdpa
========================================================================
...
========================================================================
eval_loss:        1.462601
perplexity:       4.3172
wall_time:        5.84 s
========================================================================

By default, it will test the model in the current configuration's output directory on the latest checkpoint, but you can point it to an arbitrary model directory with --model /path/to/model

Load Trained Model¶

You can use the regular HF APIs to load the saved model and tokenizer.

In [5]:

Copied!





from forgather.project import Project
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig, StoppingCriteria
from forgather.ml.sharded_checkpoint import create_pretrained_symlinks
import torch

model_path = "./output_models/tiny_llama"

# Create symlinks to latest checkpoint model output directory
# This is required for .from_pretrained() to find the latest checkpoint.
create_pretrained_symlinks(model_path)

# Set device to run inference on
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
from forgather.project import Project
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig, StoppingCriteria
from forgather.ml.sharded_checkpoint import create_pretrained_symlinks
import torch

model_path = "./output_models/tiny_llama"

# Create symlinks to latest checkpoint model output directory
# This is required for .from_pretrained() to find the latest checkpoint.
create_pretrained_symlinks(model_path)

# Set device to run inference on
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

Loading weights:   0%|          | 0/39 [00:00<?, ?it/s]

The equivalent CLI for creating symbolic links to the latest checkpoint in the output directory is:

forgather checkpoint link

Text Generation¶

This loop will use the newly trained model to generate text, seeded with the above prompts.

In [9]:

Copied!





import torch

def generate_text(model, tokenizer, prompts, gen_config, max_new_tokens, device):
    model.to(device)
    model.eval()
    
    with torch.inference_mode():
        for prompt in prompts:
            tokenizer_outputs = tokenizer(
                [prompt],
                truncation=False,
                return_length=True,
                return_tensors="pt",
                return_attention_mask=True,
            )
        
            input_ids = tokenizer_outputs["input_ids"].to(device)
            attention_mask = tokenizer_outputs["attention_mask"].to(device)
            outputs = model.generate(
                input_ids,
                attention_mask=attention_mask,
                generation_config=gen_config,
                past_key_values=None,
            )
    
            output_text = tokenizer.decode(
                outputs.sequences[0],
                skip_special_tokens=True,
            )
            yield prompt + " [START] " + output_text[len(prompt) + 1 :]

prompts = [
    'Alice was so tired when she got back home so she went',
    'Jack and Lily liked to watch the moon at night. They noticed that the moon changed its shape every night. Sometimes the moon was big and round, and sometimes it was',
    'Jack and Lily saw a rainbow after a rainy day.They were amazed by the colors. Jack said, "Look, Lily. A rainbow has',
    'Jack wanted to read a book, so he went to',
    '"Can cows fly?" Alice asked her mother.',
]

gen_config = GenerationConfig(
    pad_token_id=model.config.pad_token_id,
    bos_token_id=model.config.bos_token_id,
    eos_token_id=model.config.eos_token_id,
    do_sample=True,
    top_k=20,
    top_p=0.9,
    temperature=0.7,
    repitition_penalty=1.15,
    max_new_tokens=100,
    return_dict_in_generate=True,
)

for s in generate_text(model, tokenizer, prompts, gen_config, 100, "cuda:0"):
    print(s)
    print(f"{'-' * 40}")
import torch

def generate_text(model, tokenizer, prompts, gen_config, max_new_tokens, device):
    model.to(device)
    model.eval()
    
    with torch.inference_mode():
        for prompt in prompts:
            tokenizer_outputs = tokenizer(
                [prompt],
                truncation=False,
                return_length=True,
                return_tensors="pt",
                return_attention_mask=True,
            )
        
            input_ids = tokenizer_outputs["input_ids"].to(device)
            attention_mask = tokenizer_outputs["attention_mask"].to(device)
            outputs = model.generate(
                input_ids,
                attention_mask=attention_mask,
                generation_config=gen_config,
                past_key_values=None,
            )
    
            output_text = tokenizer.decode(
                outputs.sequences[0],
                skip_special_tokens=True,
            )
            yield prompt + " [START] " + output_text[len(prompt) + 1 :]

prompts = [
    'Alice was so tired when she got back home so she went',
    'Jack and Lily liked to watch the moon at night. They noticed that the moon changed its shape every night. Sometimes the moon was big and round, and sometimes it was',
    'Jack and Lily saw a rainbow after a rainy day.They were amazed by the colors. Jack said, "Look, Lily. A rainbow has',
    'Jack wanted to read a book, so he went to',
    '"Can cows fly?" Alice asked her mother.',
]

gen_config = GenerationConfig(
    pad_token_id=model.config.pad_token_id,
    bos_token_id=model.config.bos_token_id,
    eos_token_id=model.config.eos_token_id,
    do_sample=True,
    top_k=20,
    top_p=0.9,
    temperature=0.7,
    repitition_penalty=1.15,
    max_new_tokens=100,
    return_dict_in_generate=True,
)

for s in generate_text(model, tokenizer, prompts, gen_config, 100, "cuda:0"):
    print(s)
    print(f"{'-' * 40}")

Alice was so tired when she got back home so she went [START] to bed. She looked around and saw a big table. It was tidy and she wanted to take it home. She was so excited and ran to her mom to show her.

Mom got the table and asked, "What are you tidying?"

Alice smiled and said, "I want to go to bed."

Mom
----------------------------------------
Jack and Lily liked to watch the moon at night. They noticed that the moon changed its shape every night. Sometimes the moon was big and round, and sometimes it was [START] full of different colors.

One day, Jack was walking towards the moon when he saw a big, shiny, green cat. He wanted to play with it. He asked Lily if she wanted to play with it. Lily said yes and gave him a big hug.

Jack smiled and said, "I want to play with the cat!"

----------------------------------------
Jack and Lily saw a rainbow after a rainy day.They were amazed by the colors. Jack said, "Look, Lily. A rainbow has [START] bright blue hair."

Lily's eyes lit up. "I want to put on the rainbow," she said.

Jack said, "No, I want to put the rainbow in the sky. It might burn you."

Lily and Jack were very sad. They did not want to go back to the rainbow. They kept w
----------------------------------------
Jack wanted to read a book, so he went to [START] the park. He saw a big, shiny pencil and he wanted it. He opened it, and he saw a big pencil. He was so excited! He wanted to take it home, but it was too high for him to reach. Jack tried to grab the pencil, but he was too small. He pulled and pushed, and the p
----------------------------------------
"Can cows fly?" Alice asked her mother. [START] "Yes," her mother replied. "Let's go to the market and get some coins."

So they went to the market and started to chew it. The market was tall and tall.

"Look, Mommy!" she said. "We can't catch it!"

Her mother smiled and said, "Yes,
----------------------------------------

Train Hugginface LLama Model¶

Next, let's try training a Llama model using the Huggingface implementation.

Train the model on the CLI

forgather -t train_hf_llama.yaml train

In [ ]:

Copied!

nb.display_config(config_template="train_hf_llama.yaml", show_pp_config=True, show_generated_code=False)
nb.display_config(config_template="train_hf_llama.yaml", show_pp_config=True, show_generated_code=False)

Let's See What Happens...¶

...if we replace the post-layer-norm implementation with a pre-layer-norm implementation. This configuration uses a custom model definition in the custom_models directory.

forgather -t experimental_llama.yaml train

Test Model With the Inference Server¶

There is a simple OpenAI compatible inference server implementation in "tools/inference_server"

To host your newly trained model on the inference server:

# Manual start
forgather inf server -c -m ./output_models/tiny_llama/

# Config with YAML file
forgather inf server ./tiny_llama_server.yaml

From another session, you can perform text completion like this:

# Text completion request
forgather inf client --completion "Once upon a time"

# With manual generation settings
forgather inf client --temperature 0.7 --no-repeat-ngram-size 2 --repetition-penalty 1.2 --top-k 40 --completion "Once upon a time" --max-tokens 512

# From YAML config
forgather inf client ./tiny_llama_client.yaml --completion "Once upon a time"

As the model has not been trained on a chat format, it will not be very good at it, but you can try with:

forgather inf client ./tiny_llama_client.yaml

This server should work with other OpenAI compatible clients as well.

Train on the Full Dataset¶

The examples so far have been limited to training on only the first 10% of the dataset. You can train on the complete dataset with this configuration:

forgather -t full_dataset.yaml train

Try the new v2 Configuration¶

A new, roughly equivalent, configuration has been written, which uses the new LM training template.

This version defaults to using DDP and makes use of token-packing to improve throughput by wasting less space on pad tokens. Note that DDP does not work in a notebook, so you will need to run it from the shell.

forgather -t v2.yaml train

In [ ]: