Explicit Document Boundary Tracking¶

Overview¶

Forgather's document packing system now supports explicit document boundary tracking, which allows packed sequences to work with any tokenizer configuration, including models that lack BOS/EOS tokens (e.g., Qwen3).

The Problem¶

Traditional Approach (Token-Based)¶

Previously, the system used special tokens (BOS/EOS) to mark document boundaries:

# Document packing with BOS/EOS markers
[BOS, doc1_tok1, doc1_tok2, EOS, BOS, doc2_tok1, doc2_tok2, EOS, ...]

The data collator would detect these special tokens and reset position IDs at each boundary, ensuring correct attention masking.

Issues with Token-Based Approach¶

Missing Tokens: Models like Qwen3 don't define BOS tokens
Token Aliasing: Some tokenizers alias PAD==EOS (e.g., Qwen3's <|endoftext|>), causing every padding token to be treated as a document boundary
Semantic Mismatch: Tokens like <|im_end|> mark message boundaries, not document boundaries
Broken Conversations: In chat datasets, each message would be treated as a separate document, preventing cross-message attention

The Solution¶

Explicit Boundary Metadata¶

The new system stores document start positions as metadata alongside input IDs:

{
    "input_ids": [tok1, tok2, tok3, tok4, tok5, tok6, tok7, tok8],
    "document_starts": [0, 4]  # Doc1 starts at 0, Doc2 starts at 4
}

This approach: - Works with ANY tokenizer (no special tokens required) - Is space-efficient (stores 2-5 integers vs full position_ids array) - Has clear semantics (explicit boundaries, not inferred) - Maintains backward compatibility (falls back to token-based detection)

Usage¶

Basic Example¶

from transformers import AutoTokenizer
from forgather.ml.datasets.block_tokenizer import block_tokenize_fn
from forgather.ml.data_collator import DataCollatorForCausalLM

# Load Qwen3 tokenizer (no BOS token)
tokenizer = AutoTokenizer.from_pretrained("qwen3-1.7b-base")

# Pack documents
features = {
    "text": [
        "Document 1 text...",
        "Document 2 text...",
        "Document 3 text...",
    ]
}

result = block_tokenize_fn(
    features=features,
    tokenizer=tokenizer,
    feature="text",
    max_length=512,
    packed=True,
    packing_strategy="greedy",  # or "best_fit", "first_fit"
    add_eos=True,
)

# Result contains both input_ids and document_starts
print(result["input_ids"])        # Packed token sequences
print(result["document_starts"])  # Document boundary positions

Training with Packed Sequences¶

from torch.utils.data import DataLoader
from forgather.ml.data_collator import DataCollatorForCausalLM

# Create collator - packed sequences are auto-detected from document_starts!
collator = DataCollatorForCausalLM(
    tokenizer=tokenizer,
    # packed_sequences parameter is optional - auto-detects from document_starts field
    padding="longest",
    return_tensors="pt",
)

# Create DataLoader
dataloader = DataLoader(
    dataset,
    batch_size=4,
    collate_fn=collator,
)

# Training loop
for batch in dataloader:
    # batch contains:
    # - input_ids: Token IDs
    # - labels: Labels for causal LM
    # - position_ids: Position IDs that reset at document boundaries

    outputs = model(
        input_ids=batch["input_ids"],
        position_ids=batch["position_ids"],  # Critical for proper RoPE embeddings
        labels=batch["labels"],
    )

Auto-Detection¶

The DataCollatorForCausalLM automatically detects when to generate position IDs based on the presence of the document_starts field:

Behavior: - packed_sequences=None (default): Auto-detects from document_starts field - packed_sequences=True: Always generates position IDs (uses token-based detection if document_starts missing) - packed_sequences=False: Never generates position IDs (explicit disable)

Examples:

# Auto-detection (recommended)
collator = DataCollatorForCausalLM(tokenizer=tokenizer)
# → Generates position_ids if document_starts present, otherwise doesn't

# Explicit enable
collator = DataCollatorForCausalLM(tokenizer=tokenizer, packed_sequences=True)
# → Always generates position_ids (falls back to token-based if needed)

# Explicit disable
collator = DataCollatorForCausalLM(tokenizer=tokenizer, packed_sequences=False)
# → Never generates position_ids, even if document_starts present

This means you don't need to configure anything - the system automatically works correctly whether you're using packed sequences or not!

How It Works¶

1. Tokenization Phase¶

The block_tokenize_fn tracks where each document starts as it packs them:

# Internal tracking during packing
bin.documents = [
    (doc_idx=0, tokens_added=50),   # First doc: 50 tokens
    (doc_idx=1, tokens_added=30),   # Second doc: 30 tokens
    (doc_idx=2, tokens_added=20),   # Third doc: 20 tokens
]

# Generates document_starts
document_starts = [0, 50, 80]  # Start positions

2. Collation Phase¶

The DataCollatorForCausalLM converts boundaries to position IDs:

# Input
input_ids = [tok0, ..., tok49, tok50, ..., tok79, tok80, ..., tok99]
document_starts = [0, 50, 80]

# Generated position IDs
position_ids = [0, 1, ..., 49, 0, 1, ..., 29, 0, 1, ..., 19]
                 ↑              ↑              ↑
            Reset at each document boundary

3. Model Forward Pass¶

Position IDs ensure correct RoPE embeddings and attention masking:

# RoPE uses position IDs to encode positional information
q_rotated = apply_rope(queries, position_ids)
k_rotated = apply_rope(keys, position_ids)

# Attention mask prevents cross-document attention
# (tokens at position 0 in different documents can't attend to each other)
attention_scores = q_rotated @ k_rotated.T  # With causal mask based on position_ids

Configuration Examples¶

Forgather Project Configuration¶

-- extends "types/training_script/causal_lm/causal_lm.yaml"

-- block datasets_preprocessor_args
datasets_preprocessor_args: !dict
    max_length: 4096
    packed: True                    # Enable packing
    packing_strategy: "best_fit"   # Optimized bin packing
    overflow: False                 # Truncate long documents
    add_eos: True                   # Add EOS tokens (for compatibility)
    # Note: add_bos will be auto-disabled if tokenizer has no BOS
-- endblock datasets_preprocessor_args

-- block datacollator
datacollator: !partial:forgather.ml.data_collator:DataCollatorForCausalLM
    # packed_sequences: auto-detects from document_starts (no configuration needed!)
    max_length: 4096
    return_tensors: "pt"
-- endblock datacollator

Dataset Inspection¶

# View packed dataset with document counts
forgather -t config.yaml dataset --tokenized --examples 5

# Output:
# ------------- 0 Tokens: 512, Documents: 3, Features: dict_keys(['input_ids', 'document_starts']) -------------
# Document 1 text... Document 2 text... Document 3 text...

Backward Compatibility¶

The system automatically maintains backward compatibility:

New Datasets: Automatically generate document_starts when using block_tokenize_fn
Old Datasets: Fall back to token-based boundary detection if document_starts is missing
Legacy Code: Existing configurations continue to work without changes

# Collator automatically detects available method
if "document_starts" in features:
    # Use explicit boundaries (preferred)
    position_ids = generate_from_boundaries(input_ids, document_starts)
else:
    # Fall back to token-based detection (legacy)
    position_ids = generate_from_tokens(input_ids, eos_token_id)

Performance Considerations¶

Storage Overhead¶

Document Starts: ~2-5 integers per sequence (minimal)
Full Position IDs: ~512-4096 integers per sequence (50-100x larger)

Computation¶

Boundary-Based: O(num_docs × seq_length) - faster for few documents
Token-Based: O(seq_length) - faster for many documents per sequence

In practice, the boundary-based approach is faster for typical packing ratios (2-5 documents per sequence).

Testing¶

Comprehensive tests are available in: - tests/test_document_boundaries.py - Unit tests for all components - tests/test_qwen3_packing.py - Integration tests with real Qwen3 tokenizer

Run tests:

pytest tests/test_document_boundaries.py -v
pytest tests/test_qwen3_packing.py -v

Troubleshooting¶

Issue: Position IDs not resetting¶

Symptom: Model fails to train, attention spans across documents

Solution: Ensure packed_sequences=True in DataCollatorForCausalLM

collator = DataCollatorForCausalLM(
    tokenizer=tokenizer,
    packed_sequences=True,  # Must be True!
    return_tensors="pt",
)

Issue: Document starts field missing¶

Symptom: KeyError: 'document_starts' when using old dataset

Solution: Re-tokenize dataset with updated block_tokenize_fn, or use token-based fallback

# Option 1: Re-tokenize with new code (recommended)
dataset = dataset.map(
    lambda features: block_tokenize_fn(...),
    batched=True,
)

# Option 2: Collator will automatically fall back to token-based detection

Issue: PAD tokens treated as boundaries¶

Symptom: Position IDs reset at every padding token

Solution: This is a sign of PAD==EOS aliasing (like Qwen3's <|endoftext|>). The new system solves this by using explicit boundaries instead of token-based detection.

References¶

Original sequence packing blog: https://huggingface.co/blog/sirluk/llm-sequence-packing
Forgather block tokenizer: src/forgather/ml/datasets/block_tokenizer.py
Forgather data collator: src/forgather/ml/data_collator.py