Explicit Document Boundary Tracking¶
Overview¶
Forgather's document packing system now supports explicit document boundary tracking, which allows packed sequences to work with any tokenizer configuration, including models that lack BOS/EOS tokens (e.g., Qwen3).
The Problem¶
Traditional Approach (Token-Based)¶
Previously, the system used special tokens (BOS/EOS) to mark document boundaries:
# Document packing with BOS/EOS markers
[BOS, doc1_tok1, doc1_tok2, EOS, BOS, doc2_tok1, doc2_tok2, EOS, ...]
The data collator would detect these special tokens and reset position IDs at each boundary, ensuring correct attention masking.
Issues with Token-Based Approach¶
- Missing Tokens: Models like Qwen3 don't define BOS tokens
- Token Aliasing: Some tokenizers alias PAD==EOS (e.g., Qwen3's
<|endoftext|>), causing every padding token to be treated as a document boundary - Semantic Mismatch: Tokens like
<|im_end|>mark message boundaries, not document boundaries - Broken Conversations: In chat datasets, each message would be treated as a separate document, preventing cross-message attention
The Solution¶
Explicit Boundary Metadata¶
The new system stores document start positions as metadata alongside input IDs:
{
"input_ids": [tok1, tok2, tok3, tok4, tok5, tok6, tok7, tok8],
"document_starts": [0, 4] # Doc1 starts at 0, Doc2 starts at 4
}
This approach: - Works with ANY tokenizer (no special tokens required) - Is space-efficient (stores 2-5 integers vs full position_ids array) - Has clear semantics (explicit boundaries, not inferred) - Maintains backward compatibility (falls back to token-based detection)
Usage¶
Basic Example¶
from transformers import AutoTokenizer
from forgather.ml.datasets.block_tokenizer import block_tokenize_fn
from forgather.ml.data_collator import DataCollatorForCausalLM
# Load Qwen3 tokenizer (no BOS token)
tokenizer = AutoTokenizer.from_pretrained("qwen3-1.7b-base")
# Pack documents
features = {
"text": [
"Document 1 text...",
"Document 2 text...",
"Document 3 text...",
]
}
result = block_tokenize_fn(
features=features,
tokenizer=tokenizer,
feature="text",
max_length=512,
packed=True,
packing_strategy="greedy", # or "best_fit", "first_fit"
add_eos=True,
)
# Result contains both input_ids and document_starts
print(result["input_ids"]) # Packed token sequences
print(result["document_starts"]) # Document boundary positions
Training with Packed Sequences¶
from torch.utils.data import DataLoader
from forgather.ml.data_collator import DataCollatorForCausalLM
# Create collator - packed sequences are auto-detected from document_starts!
collator = DataCollatorForCausalLM(
tokenizer=tokenizer,
# packed_sequences parameter is optional - auto-detects from document_starts field
padding="longest",
return_tensors="pt",
)
# Create DataLoader
dataloader = DataLoader(
dataset,
batch_size=4,
collate_fn=collator,
)
# Training loop
for batch in dataloader:
# batch contains:
# - input_ids: Token IDs
# - labels: Labels for causal LM
# - position_ids: Position IDs that reset at document boundaries
outputs = model(
input_ids=batch["input_ids"],
position_ids=batch["position_ids"], # Critical for proper RoPE embeddings
labels=batch["labels"],
)
Auto-Detection¶
The DataCollatorForCausalLM automatically detects when to generate position IDs based on the presence of the document_starts field:
Behavior:
- packed_sequences=None (default): Auto-detects from document_starts field
- packed_sequences=True: Always generates position IDs (uses token-based detection if document_starts missing)
- packed_sequences=False: Never generates position IDs (explicit disable)
Examples:
# Auto-detection (recommended)
collator = DataCollatorForCausalLM(tokenizer=tokenizer)
# → Generates position_ids if document_starts present, otherwise doesn't
# Explicit enable
collator = DataCollatorForCausalLM(tokenizer=tokenizer, packed_sequences=True)
# → Always generates position_ids (falls back to token-based if needed)
# Explicit disable
collator = DataCollatorForCausalLM(tokenizer=tokenizer, packed_sequences=False)
# → Never generates position_ids, even if document_starts present
This means you don't need to configure anything - the system automatically works correctly whether you're using packed sequences or not!
How It Works¶
1. Tokenization Phase¶
The block_tokenize_fn tracks where each document starts as it packs them:
# Internal tracking during packing
bin.documents = [
(doc_idx=0, tokens_added=50), # First doc: 50 tokens
(doc_idx=1, tokens_added=30), # Second doc: 30 tokens
(doc_idx=2, tokens_added=20), # Third doc: 20 tokens
]
# Generates document_starts
document_starts = [0, 50, 80] # Start positions
2. Collation Phase¶
The DataCollatorForCausalLM converts boundaries to position IDs:
# Input
input_ids = [tok0, ..., tok49, tok50, ..., tok79, tok80, ..., tok99]
document_starts = [0, 50, 80]
# Generated position IDs
position_ids = [0, 1, ..., 49, 0, 1, ..., 29, 0, 1, ..., 19]
↑ ↑ ↑
Reset at each document boundary
3. Model Forward Pass¶
Position IDs ensure correct RoPE embeddings and attention masking:
# RoPE uses position IDs to encode positional information
q_rotated = apply_rope(queries, position_ids)
k_rotated = apply_rope(keys, position_ids)
# Attention mask prevents cross-document attention
# (tokens at position 0 in different documents can't attend to each other)
attention_scores = q_rotated @ k_rotated.T # With causal mask based on position_ids
Configuration Examples¶
Forgather Project Configuration¶
-- extends "types/training_script/causal_lm/causal_lm.yaml"
-- block datasets_preprocessor_args
datasets_preprocessor_args: !dict
max_length: 4096
packed: True # Enable packing
packing_strategy: "best_fit" # Optimized bin packing
overflow: False # Truncate long documents
add_eos: True # Add EOS tokens (for compatibility)
# Note: add_bos will be auto-disabled if tokenizer has no BOS
-- endblock datasets_preprocessor_args
-- block datacollator
datacollator: !partial:forgather.ml.data_collator:DataCollatorForCausalLM
# packed_sequences: auto-detects from document_starts (no configuration needed!)
max_length: 4096
return_tensors: "pt"
-- endblock datacollator
Dataset Inspection¶
# View packed dataset with document counts
forgather -t config.yaml dataset --tokenized --examples 5
# Output:
# ------------- 0 Tokens: 512, Documents: 3, Features: dict_keys(['input_ids', 'document_starts']) -------------
# Document 1 text... Document 2 text... Document 3 text...
Backward Compatibility¶
The system automatically maintains backward compatibility:
- New Datasets: Automatically generate
document_startswhen usingblock_tokenize_fn - Old Datasets: Fall back to token-based boundary detection if
document_startsis missing - Legacy Code: Existing configurations continue to work without changes
# Collator automatically detects available method
if "document_starts" in features:
# Use explicit boundaries (preferred)
position_ids = generate_from_boundaries(input_ids, document_starts)
else:
# Fall back to token-based detection (legacy)
position_ids = generate_from_tokens(input_ids, eos_token_id)
Performance Considerations¶
Storage Overhead¶
- Document Starts: ~2-5 integers per sequence (minimal)
- Full Position IDs: ~512-4096 integers per sequence (50-100x larger)
Computation¶
- Boundary-Based: O(num_docs × seq_length) - faster for few documents
- Token-Based: O(seq_length) - faster for many documents per sequence
In practice, the boundary-based approach is faster for typical packing ratios (2-5 documents per sequence).
Testing¶
Comprehensive tests are available in:
- tests/test_document_boundaries.py - Unit tests for all components
- tests/test_qwen3_packing.py - Integration tests with real Qwen3 tokenizer
Run tests:
Troubleshooting¶
Issue: Position IDs not resetting¶
Symptom: Model fails to train, attention spans across documents
Solution: Ensure packed_sequences=True in DataCollatorForCausalLM
collator = DataCollatorForCausalLM(
tokenizer=tokenizer,
packed_sequences=True, # Must be True!
return_tensors="pt",
)
Issue: Document starts field missing¶
Symptom: KeyError: 'document_starts' when using old dataset
Solution: Re-tokenize dataset with updated block_tokenize_fn, or use token-based fallback
# Option 1: Re-tokenize with new code (recommended)
dataset = dataset.map(
lambda features: block_tokenize_fn(...),
batched=True,
)
# Option 2: Collator will automatically fall back to token-based detection
Issue: PAD tokens treated as boundaries¶
Symptom: Position IDs reset at every padding token
Solution: This is a sign of PAD==EOS aliasing (like Qwen3's <|endoftext|>). The new system solves this by using explicit boundaries instead of token-based detection.
References¶
- Original sequence packing blog: https://huggingface.co/blog/sirluk/llm-sequence-packing
- Forgather block tokenizer:
src/forgather/ml/datasets/block_tokenizer.py - Forgather data collator:
src/forgather/ml/data_collator.py