Sequence Packing Quick Reference
TL;DR - Which Strategy?
| Your Scenario |
Use This |
Truncating long documents (overflow=False) |
packing_strategy="best_fit" + shuffle_output=True |
| Variable-length documents, need all content |
packing_strategy="greedy" or "first_fit" |
| Consistent document lengths |
packing_strategy="greedy" |
| Very long sequences (>16k tokens) |
packing_strategy="greedy" |
| Want speed > efficiency |
packing_strategy="greedy" |
| Want efficiency > speed |
packing_strategy="best_fit" + shuffle_output=True |
Quick Start
Pretraining (Most Common)
[map_function]
.define: &map_function !partial:forgather.ml.datasets:block_tokenize_fn
max_length: 4096
overflow: False
packed: True
packing_strategy: "best_fit"
shuffle_output: True
min_len: 64
add_bos: True
add_eos: True
Finetuning
[map_function]
.define: &map_function !partial:forgather.ml.datasets:block_tokenize_fn
max_length: 2048
overflow: True
packed: True
packing_strategy: "greedy"
min_len: 32
add_bos: True
add_eos: True
Parameter Cheat Sheet
| Parameter |
Values |
Default |
When to Change |
max_length |
512-131072 |
512 |
Match your model's context window |
overflow |
True/False |
True |
Set False to truncate long docs (pretraining) |
packed |
True/False |
False |
Set True to enable packing |
packing_strategy |
"greedy"/"best_fit"/"first_fit" |
"greedy" |
Use "best_fit" with overflow=False |
shuffle_output |
True/False |
False |
Set True with best_fit/first_fit |
seed |
int or None |
None |
Set during dev for reproducibility |
min_len |
1-max_length |
1 |
Set to ~max_length/64 to filter short seqs |
add_bos |
True/False |
True |
Usually keep True |
add_eos |
True/False |
True |
Usually keep True |
stride |
0-512 |
0 |
Set 64-256 for document overlap |
Best Fit vs Greedy (overflow=False, 200 docs, max_length=128)
| Metric |
Greedy |
Best Fit |
Improvement |
| Output blocks |
200 |
100 |
50% fewer |
| Avg utilization |
49.8% |
98.4% |
+48.6 pp |
| Speed |
Fastest |
Fast |
Minimal difference |
Best Fit vs Greedy (overflow=True, 500 docs, max_length=4096)
| Metric |
Greedy |
Best Fit |
Improvement |
| Output blocks |
12 |
12 |
None |
| Avg utilization |
95.4% |
95.4% |
None |
| Speed |
Fastest |
Fast |
Minimal difference |
Takeaway: Use best_fit with overflow=False for maximum benefit.
Common Mistakes
❌ Don't Do This
# Small batch size - loses data at boundaries
map_fn: *tokenizer
batched: True
batch_size: 10 # BAD!
# Using best_fit without shuffle - causes training bias
packing_strategy: "best_fit"
shuffle_output: False # BAD!
# Optimized strategy without packing enabled
packed: False
packing_strategy: "best_fit" # Ignored! Needs packed=True
✅ Do This
# Large batch size - minimal data loss
map_fn: *tokenizer
batched: True
batch_size: 1000 # GOOD!
# Shuffling with optimized strategies
packing_strategy: "best_fit"
shuffle_output: True # GOOD!
# Enable packing with strategy
packed: True
packing_strategy: "best_fit" # Works!
Decision Tree
Start: Do you need sequence packing?
│
├─ No → Use standard tokenization
│
└─ Yes → Are you truncating long documents (overflow=False)?
│
├─ Yes → Use packing_strategy="best_fit", shuffle_output=True
│ (50% fewer blocks, 98% utilization)
│
└─ No (overflow=True) → Are documents highly variable in length?
│
├─ Yes → Use packing_strategy="first_fit", shuffle_output=True
│ (Good balance of speed and efficiency)
│
└─ No → Use packing_strategy="greedy"
(Fast, simple, already efficient)
Testing Your Config
After configuring packing, verify it's working:
# Check utilization
lengths = [len(seq) for seq in dataset["input_ids"]]
avg = sum(lengths) / len(lengths)
utilization = (avg / max_length) * 100
print(f"Utilization: {utilization:.1f}%")
# Target: >90% for good efficiency
# Check sequence count
print(f"Output sequences: {len(dataset)}")
print(f"Input documents: {original_dataset_size}")
print(f"Packing ratio: {original_dataset_size / len(dataset):.1f}x")
# Good packing ratio: 2-10x depending on document lengths
Complete Example
# Pretraining configuration with optimized packing
[tokenizer]
.define: &tokenizer !singleton:transformers:AutoTokenizer.from_pretrained
pretrained_model_name_or_path: "meta-llama/Llama-3.2-1B"
[map_function]
.define: &map_function !partial:forgather.ml.datasets:block_tokenize_fn
tokenizer: *tokenizer
feature: "text"
max_length: 4096
overflow: False # Truncate long docs
packed: True # Enable packing
packing_strategy: "best_fit" # Optimal for overflow=False
shuffle_output: True # Randomize output order
seed: null # Random (use 42 for reproducible dev)
stride: 0 # No overlap
min_len: 64 # Filter very short sequences
add_bos: True # Add BOS tokens
add_eos: True # Add EOS tokens
[train_dataset]
== super()
map_fn: *map_function
batched: True
batch_size: 1000 # Large batch critical!
num_proc: 4 # Parallel processing
remove_columns: ["text"]
[data_collator]
.define: &data_collator !singleton:forgather.ml.data_collator:DataCollatorForCausalLM
tokenizer: *tokenizer
packed_sequences: True # Generate position IDs
padding: "longest"
max_length: 4096
One-Liners
# Test packing efficiency
python -c "from datasets import load_dataset; ds=load_dataset('your_dataset'); print(f'Docs: {len(ds)}, Avg length: {sum(len(x) for x in ds[\"text\"])/len(ds)}')"
# Quick benchmark
python tests/test_packing_comparison.py
# Check packed dataset
python -c "from datasets import load_dataset; ds=load_dataset('path/to/packed'); lengths=[len(x) for x in ds['input_ids']]; print(f'Utilization: {sum(lengths)/len(lengths)/4096*100:.1f}%')"
Further Reading
- Full guide:
docs/datasets/sequence-packing.md
- Source code:
src/forgather/ml/datasets/block_tokenizer.py
- Tests:
tests/test_packing_comparison.py
- HF Blog: https://huggingface.co/blog/sirluk/llm-sequence-packing