Dataset Projects¶

This guide covers how to create, configure, and use dataset projects in Forgather. Dataset projects are standalone Forgather projects that encapsulate dataset loading, splitting, preprocessing, and tokenization. They can be used directly from Python or referenced from training configurations via from_project.

Overview¶

A dataset project is a standard Forgather project whose configuration class is type.dataset. It typically exposes these materialization targets:

Target	Description
`train_dataset_split`	Raw training split (before tokenization)
`validation_dataset_split`	Raw validation split
`test_dataset_split`	Raw test split
`train_dataset`	Preprocessed/tokenized training dataset
`eval_dataset`	Preprocessed/tokenized evaluation dataset
`test_dataset`	Preprocessed/tokenized test dataset
`meta`	Configuration metadata (config name, class, main feature)

The train_dataset_split targets return untokenized data, while train_dataset and friends return data that has been passed through preprocess_dataset, which handles tokenization, sharding, shuffling, and range selection.

Project Structure¶

A dataset project follows the standard Forgather project layout:

examples/datasets/roneneldan/
    meta.yaml                              # Project metadata
    templatelib/
        configs/
            tinystories.yaml               # Full TinyStories dataset
            tinystories-abridged.yaml      # 10% subset
            tinystories-packed.yaml        # Packed sequences variant

The meta.yaml declares the project name and default configuration:

-- extends "meta_defaults.yaml"

-- block configs
name: "roneneldan"
description: "Datasets published by Ronen Eldan"
config_prefix: "configs"
default_config: "tinystories-abridged.yaml"
<< endblock configs

Base Templates¶

Forgather provides base templates in templatelib/base/datasets/ that dataset configurations extend:

`datasets/load_dataset.yaml`¶

The most commonly used base template. It provides a complete pipeline for loading a HuggingFace dataset and preprocessing it. Key blocks to override:

[load_dataset_args] -- the path, name, revision, and split definitions
[train_dataset_split], [validation_dataset_split], [test_dataset_split] -- split selection
[map_function] -- custom tokenization/mapping function
[map_kwargs] -- additional arguments for the map call

`datasets/tokenized_dataset.yaml`¶

A more flexible base template with explicit blocks for every stage of the pipeline. Use this when you need full control over dataset loading (e.g., loading from disk instead of HuggingFace Hub).

`datasets/dataset_type.yaml`¶

The root base template for all dataset projects. Defines the type.dataset config class and the main output target.

Using Dataset Projects from Python¶

Basic Usage¶

from forgather import Project

# Load a dataset project
dataset_project = Project(
    "tinystories-abridged.yaml",
    "examples/datasets/roneneldan",
)

# Materialize the raw training split (no tokenization)
raw_train = dataset_project("train_dataset_split")

# Inspect examples
for example in raw_train:
    print(example["text"][:200])
    break

With Tokenization¶

To get tokenized datasets, pass a tokenizer and preprocessing args:

from transformers import AutoTokenizer
from forgather import Project

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("path/to/tokenizer")

# Load dataset project
dataset_project = Project(
    "tinystories-abridged.yaml",
    "examples/datasets/roneneldan",
)

# Materialize tokenized train and eval datasets
train_dataset, eval_dataset = dataset_project(
    "train_dataset",
    "eval_dataset",
    tokenizer=tokenizer,
    preprocess_args=dict(truncation=True, max_length=512),
)

# Each example now has 'input_ids'
for example in train_dataset:
    print(f"Token count: {len(example['input_ids'])}")
    break

With Dataset Sharding (Distributed Training)¶

For distributed training, pass shard_dataset to split data across ranks:

train_dataset, eval_dataset = dataset_project(
    "train_dataset",
    "eval_dataset",
    tokenizer=tokenizer,
    preprocess_args=dict(truncation=True, max_length=4096),
    shard_dataset=True,  # Auto-shards by WORLD_SIZE and RANK
)

See the section on shard_dataset below for the full details.

Using Dataset Projects in Training Configurations¶

Training configurations reference dataset projects using from_project, which loads a sub-project and materializes its targets:

[datasets_definition]
.define: &dataset_dict !call:forgather:from_project
    project_dir: "{{ ns.dataset_proj }}"
    config_template: "{{ ns.dataset_config }}"
    targets: [ "train_dataset", "eval_dataset" ]
    preprocess_args: *tokenizer_args
    tokenizer: *tokenizer

train_dataset: &train_dataset !call:getitem [ *dataset_dict, 'train_dataset' ]
eval_dataset: &eval_dataset !call:getitem [ *dataset_dict, 'eval_dataset' ]

Passing Preprocessor Arguments via `pp_kwargs`¶

The pp_kwargs parameter passes arguments to the sub-project's Jinja2 preprocessor (template variables), while other keyword arguments are passed at materialization time (runtime variables accessed via !var):

.define: &dataset_dict !call:forgather:from_project
    project_dir: "{{ ns.dataset_proj }}"
    config_template: "{{ ns.dataset_config }}"
    targets: [ "train_dataset", "eval_dataset" ]
    # pp_kwargs are Jinja2 template variables (resolved at config preprocessing time)
    pp_kwargs: *dataset_project_pp_args
    # These are runtime variables (resolved when the config graph is materialized)
    preprocess_args: *tokenizer_args
    tokenizer: *tokenizer
    shard_dataset: True

Dataset Sharding for Distributed Training¶

When using multi-GPU training without batch dispatching (i.e., dispatch_batches: False), each rank needs its own shard of the dataset. Pass shard_dataset: True to enable automatic sharding:

.define: &dataset_dict !call:forgather:from_project
    project_dir: "{{ ns.dataset_proj }}"
    config_template: "{{ ns.dataset_config }}"
    targets: [ "train_dataset", "eval_dataset" ]
    preprocess_args: *tokenizer_args
    tokenizer: *tokenizer
    shard_dataset: {{ ns.dispatch_batches == False }}

The `shard_dataset` Parameter¶

The shard_dataset parameter on preprocess_dataset controls how datasets are split for distributed training. It accepts two forms.

Boolean Form¶

train_dataset = dataset_project(
    "train_dataset",
    tokenizer=tokenizer,
    shard_dataset=True,
)

When True, the dataset is automatically sharded using the current distributed environment: - num_shards defaults to WORLD_SIZE (total number of processes) - index defaults to RANK (current process rank)

When False or None, no sharding is performed.

Dictionary Form¶

train_dataset = dataset_project(
    "train_dataset",
    tokenizer=tokenizer,
    shard_dataset=dict(
        num_shards=4,
        index=1,  # This process gets shard 1 of 4
    ),
)

This gives explicit control over the number of shards and which shard the current process receives. This is useful for: - Testing a specific shard locally - Custom parallelism strategies where the shard count differs from world size - DiLoCo distributed training where workers are independent processes (not torch distributed ranks)

How Sharding Works Internally¶

The implementation in preprocess_dataset (src/forgather/ml/datasets/preprocess.py):

If shard_dataset is True, it resolves to {"num_shards": WORLD_SIZE, "index": RANK}.
If shard_dataset is False, it becomes None (no sharding).
For HuggingFace Dataset or IterableDataset, it uses split_dataset_by_node from the datasets library.
For other dataset types (e.g., SimpleArrowIterableDataset), it calls the .shard() method.
When sharding is enabled, main_process_first() is not used, because each rank processes its own independent shard. When sharding is disabled, main_process_first() ensures rank 0 preprocesses and caches the dataset before other ranks load the cache.

In YAML Configurations¶

The load_dataset.yaml base template exposes shard_dataset as a runtime variable with a default of False:

[dataset_args]
.define: &dataset_args
    shard_dataset: !var [ "shard_dataset", False ]

This means the calling configuration (or Python code) controls whether sharding is active.

CLI Usage¶

The forgather dataset command also supports sharding for testing:

# Test shard 0 of 4
forgather -p examples/datasets/roneneldan dataset --num-shards 4 --shard-index 0 -n 5

# Test shard 3 of 4
forgather -p examples/datasets/roneneldan dataset --num-shards 4 --shard-index 3 -n 5

The `select_range` Parameter¶

The preprocess_dataset function accepts a select_range parameter that allows subsetting the dataset before processing. It supports several formats:

Format	Example	Description
`int`	`500`	First 500 records
`float`	`0.25`	First 25% of records
`str` (slice)	`"100:500"`	Records 100 through 499
`str` (percent)	`"10%:80%"`	Records from 10% to 80%
`str` (open)	`"100:"`	Records from 100 to end
`Sequence`	`[100, 900]`	Records 100 through 899
`range`	`range(10, 100)`	Direct range object

CLI: Testing and Inspecting Datasets¶

The forgather dataset command provides tools for inspecting dataset projects without writing Python code.

Viewing Raw Examples¶

# Show 5 raw (untokenized) examples from the default target (train_dataset_split)
forgather -p examples/datasets/roneneldan dataset -n 5

# Show examples from a specific config
forgather -p examples/datasets/roneneldan -t tinystories.yaml dataset -n 3

Viewing Tokenized Examples¶

# Tokenize and show examples (requires a tokenizer)
forgather -p examples/datasets/roneneldan dataset \
    -T path/to/tokenizer \
    --target train_dataset \
    -s -n 5

Generating Token Length Histograms¶

forgather -p examples/datasets/roneneldan dataset \
    -T path/to/tokenizer \
    -H --histogram-samples 2000

Example Dataset Projects¶

Forgather ships with several example dataset projects under examples/datasets/:

Project	Path	Description
TinyStories	`examples/datasets/roneneldan/`	Small synthetic stories dataset, good for quick experiments
SmolLM Corpus	`examples/datasets/HuggingFaceTB/`	FineWeb-Edu, Cosmopedia-v2, and interleaved variants
Wikitext	`examples/datasets/EleutherAI/`	Document-level Wikipedia text
Local Dataset	`examples/datasets/local_dataset/`	Template for loading datasets from disk

Key Relationships¶

Dataset projects are regular Forgather projects with config_class: type.dataset.
Training projects reference dataset projects using !call:forgather:from_project.
from_project creates a Project instance for the dataset sub-project and calls it with the given targets and keyword arguments. The result is a dictionary keyed by target name.
Runtime variables in dataset configs (e.g., !var "tokenizer", !var "shard_dataset") are resolved from the kwargs passed by the calling project or Python code.
preprocess_dataset is the central preprocessing function that handles tokenization, sharding, shuffling, and range selection.