Sequence Packing Quick Reference¶

TL;DR - Which Strategy?¶

Your Scenario	Use This
Truncating long documents (`overflow=False`)	`packing_strategy="best_fit"` + `shuffle_output=True`
Variable-length documents, need all content	`packing_strategy="greedy"` or `"first_fit"`
Consistent document lengths	`packing_strategy="greedy"`
Very long sequences (>16k tokens)	`packing_strategy="greedy"`
Want speed > efficiency	`packing_strategy="greedy"`
Want efficiency > speed	`packing_strategy="best_fit"` + `shuffle_output=True`

Quick Start¶

Pretraining (Most Common)¶

[map_function]
.define: &map_function !partial:forgather.ml.datasets:block_tokenize_fn
    max_length: 4096
    overflow: False
    packed: True
    packing_strategy: "best_fit"
    shuffle_output: True
    min_len: 64
    add_bos: True
    add_eos: True

Finetuning¶

[map_function]
.define: &map_function !partial:forgather.ml.datasets:block_tokenize_fn
    max_length: 2048
    overflow: True
    packed: True
    packing_strategy: "greedy"
    min_len: 32
    add_bos: True
    add_eos: True

Parameter Cheat Sheet¶

Parameter	Values	Default	When to Change
`max_length`	512-131072	512	Match your model's context window
`overflow`	True/False	True	Set False to truncate long docs (pretraining)
`packed`	True/False	False	Set True to enable packing
`packing_strategy`	"greedy"/"best_fit"/"first_fit"	"greedy"	Use "best_fit" with overflow=False
`shuffle_output`	True/False	False	Set True with best_fit/first_fit
`seed`	int or None	None	Set during dev for reproducibility
`min_len`	1-max_length	1	Set to ~max_length/64 to filter short seqs
`add_bos`	True/False	True	Usually keep True
`add_eos`	True/False	True	Usually keep True
`stride`	0-512	0	Set 64-256 for document overlap

Performance Impact¶

Best Fit vs Greedy (overflow=False, 200 docs, max_length=128)¶

Metric	Greedy	Best Fit	Improvement
Output blocks	200	100	50% fewer
Avg utilization	49.8%	98.4%	+48.6 pp
Speed	Fastest	Fast	Minimal difference

Best Fit vs Greedy (overflow=True, 500 docs, max_length=4096)¶

Metric	Greedy	Best Fit	Improvement
Output blocks	12	12	None
Avg utilization	95.4%	95.4%	None
Speed	Fastest	Fast	Minimal difference

Takeaway: Use best_fit with overflow=False for maximum benefit.

Common Mistakes¶

❌ Don't Do This¶

# Small batch size - loses data at boundaries
map_fn: *tokenizer
batched: True
batch_size: 10  # BAD!

# Using best_fit without shuffle - causes training bias
packing_strategy: "best_fit"
shuffle_output: False  # BAD!

# Optimized strategy without packing enabled
packed: False
packing_strategy: "best_fit"  # Ignored! Needs packed=True

✅ Do This¶

# Large batch size - minimal data loss
map_fn: *tokenizer
batched: True
batch_size: 1000  # GOOD!

# Shuffling with optimized strategies
packing_strategy: "best_fit"
shuffle_output: True  # GOOD!

# Enable packing with strategy
packed: True
packing_strategy: "best_fit"  # Works!

Decision Tree¶

Start: Do you need sequence packing?
│
├─ No  → Use standard tokenization
│
└─ Yes → Are you truncating long documents (overflow=False)?
    │
    ├─ Yes → Use packing_strategy="best_fit", shuffle_output=True
    │        (50% fewer blocks, 98% utilization)
    │
    └─ No (overflow=True) → Are documents highly variable in length?
        │
        ├─ Yes → Use packing_strategy="first_fit", shuffle_output=True
        │        (Good balance of speed and efficiency)
        │
        └─ No → Use packing_strategy="greedy"
                 (Fast, simple, already efficient)

Testing Your Config¶

After configuring packing, verify it's working:

# Check utilization
lengths = [len(seq) for seq in dataset["input_ids"]]
avg = sum(lengths) / len(lengths)
utilization = (avg / max_length) * 100
print(f"Utilization: {utilization:.1f}%")

# Target: >90% for good efficiency

# Check sequence count
print(f"Output sequences: {len(dataset)}")
print(f"Input documents: {original_dataset_size}")
print(f"Packing ratio: {original_dataset_size / len(dataset):.1f}x")

# Good packing ratio: 2-10x depending on document lengths

Complete Example¶

# Pretraining configuration with optimized packing
[tokenizer]
.define: &tokenizer !singleton:transformers:AutoTokenizer.from_pretrained
    pretrained_model_name_or_path: "meta-llama/Llama-3.2-1B"

[map_function]
.define: &map_function !partial:forgather.ml.datasets:block_tokenize_fn
    tokenizer: *tokenizer
    feature: "text"
    max_length: 4096
    overflow: False           # Truncate long docs
    packed: True              # Enable packing
    packing_strategy: "best_fit"  # Optimal for overflow=False
    shuffle_output: True      # Randomize output order
    seed: null                # Random (use 42 for reproducible dev)
    stride: 0                 # No overlap
    min_len: 64               # Filter very short sequences
    add_bos: True             # Add BOS tokens
    add_eos: True             # Add EOS tokens

[train_dataset]
    == super()
    map_fn: *map_function
    batched: True
    batch_size: 1000          # Large batch critical!
    num_proc: 4               # Parallel processing
    remove_columns: ["text"]

[data_collator]
.define: &data_collator !singleton:forgather.ml.data_collator:DataCollatorForCausalLM
    tokenizer: *tokenizer
    packed_sequences: True    # Generate position IDs
    padding: "longest"
    max_length: 4096

One-Liners¶

# Test packing efficiency
python -c "from datasets import load_dataset; ds=load_dataset('your_dataset'); print(f'Docs: {len(ds)}, Avg length: {sum(len(x) for x in ds[\"text\"])/len(ds)}')"

# Quick benchmark
python tests/test_packing_comparison.py

# Check packed dataset
python -c "from datasets import load_dataset; ds=load_dataset('path/to/packed'); lengths=[len(x) for x in ds['input_ids']]; print(f'Utilization: {sum(lengths)/len(lengths)/4096*100:.1f}%')"

Sequence Packing Quick Reference¶

TL;DR - Which Strategy?¶

Quick Start¶

Pretraining (Most Common)¶

Finetuning¶

Parameter Cheat Sheet¶

Performance Impact¶

Best Fit vs Greedy (overflow=False, 200 docs, max_length=128)¶

Best Fit vs Greedy (overflow=True, 500 docs, max_length=4096)¶

Common Mistakes¶

❌ Don't Do This¶

✅ Do This¶

Decision Tree¶

Testing Your Config¶

Complete Example¶

One-Liners¶

Further Reading¶