OpenAssistant Dataset¶
OpenAssistant is a conversational AI dataset created by the LAION AI community. This implementation provides a high-performance, configurable dataset generator for training chat-based language models with the Forgather framework.
We take the conversation trees from the dataset and dynamically generate examples on-the-fly by performing a random tree-walk, where the branch probabilities are determined by the quality metrics.
Overview¶
The OpenAssistant dataset consists of conversation trees - multi-turn dialogues between users and assistants where each message can have multiple reply branches. Our implementation:
- Randomly samples conversation threads from trees using quality-weighted branching
- Applies chat templates to format conversations for model training
- Supports sequence packing to maximize GPU utilization
- Provides extensive filtering by language, quality, and message attributes
- Guarantees deterministic output with configurable random seeds
- Optimized for performance with lazy dataset creation and custom fingerprinting
Dataset Structure¶
Source Format: JSONL files containing conversation trees - Each tree has a root prompt with nested reply branches - Messages include quality scores, language tags, and metadata - Trees are split deterministically into train/validation/test sets
Output Format: - Basic mode: One conversation per example - Packed mode: Multiple conversations packed into fixed-length sequences
Key Features¶
Quality-Weighted Sampling¶
When traversing conversation trees, replies are sampled using quality scores with temperature-controlled softmax:
- Higher quality messages have higher probability of selection
- branch_temperature parameter controls randomness (1.0 = proportional, > 0 = more uniform probabilities, < 0 = more deterministic)
- Missing quality scores are imputed with median values
Filtering Options¶
Language filtering: Select specific languages (e.g., ['en', 'es', 'de'])
Quality threshold: Minimum quality score for messages
Thread length: Min/max conversation turns (default: 2-7)
Content filtering: Exclude deleted or synthetic messages
Deterministic Generation¶
All randomness uses configurable seeds for reproducibility: - Tree selection and thread generation use deterministic hashing (CRC32) - Identical configurations always produce identical datasets - Each split (train/val/test) gets a unique derived seed
Configuration Examples¶
Basic Configuration¶
The [openassistant.yaml](templatelib/configs/openassistant.yaml) configuration provides single conversations per example:
Usage:
# Dump first five examples from "train" split
forgather -t openassistant.yaml dataset -T path/to/tokenizer \
--target train_dataset_split -n 5
# Generate dataset statistics and sequence length (tokens)
forgather -t openassistant.yaml dataset \
-T path/to/tokenizer --target train_dataset_split -H
# With Mistral 7B tokenizer...
sample size: 1000
min: 16
max: 2140
mean: 297.5119934082031
median: 234.0
std: 251.50820922851562
Packed Configuration¶
The [openassistant_packed.yaml](templatelib/configs/openassistant_packed.yaml) extends the base config to pack multiple conversations into each example:
Usage:
# Dump first three packed examples
forgather -t openassistant_packed.yaml dataset -T /path/to/tokenizer \
--target train_dataset --max-length 2048 -s -n 3
Direct Python Usage¶
Basic Usage¶
# Add src directory to path for imports
# This example assumes we are in the "examples" directory
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
from openassistant import OpenAssistantDatasetDict, OpenAssistantConfig
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("/path/to/tokenizer")
# Create configuration
config = OpenAssistantConfig(
languages=['en'],
min_quality=0.5,
min_thread_length=2,
max_thread_length=7,
exclude_deleted=True,
exclude_synthetic=True,
branch_temperature=1.0,
seed=42,
val_split=10,
test_split=10
)
# Create dataset dict
dataset_dict = OpenAssistantDatasetDict(
tokenizer=tokenizer,
chat_template="", # Use tokenizer's chat template
**config.__dict__
)
# Access splits
train_dataset = dataset_dict['train']
val_dataset = dataset_dict['validation']
test_dataset = dataset_dict['test']
# Iterate through examples
for example in train_dataset:
print(example['text'])
break
Advanced: Custom Chat Template¶
from pathlib import Path
# Define custom template
custom_template = """{% for message in messages %}
<|{{ message['role'] }}|>
{{ message['content'] }}
<|end|>
{% endfor %}"""
# Save template to file
template_path = Path("custom_template.jinja")
template_path.write_text(custom_template)
# Create dataset with custom template
dataset_dict = OpenAssistantDatasetDict(
tokenizer=tokenizer,
chat_template=str(template_path),
languages=['en', 'es'], # Multiple languages
min_quality=0.6, # Higher quality threshold
branch_temperature=0.5, # More deterministic branching
seed=123
)
Advanced: Parsing Conversations¶
To parse the ChatML formatted text back into structured messages:
import re
def parse_conversation(text):
"""Parse ChatML formatted text into messages."""
pattern = r'<\|im_start\|>(user|assistant)\n(.*?)<\|im_end\|>'
matches = re.findall(pattern, text, re.DOTALL)
messages = []
for role, content in matches:
messages.append({
'role': role,
'content': content.strip()
})
return messages
# Create dataset
dataset_dict = OpenAssistantDatasetDict(
tokenizer=None,
chat_template="", # Uses default ChatML template
languages=['en'],
)
# Parse examples
for example in dataset_dict['train']:
messages = parse_conversation(example['text'])
for msg in messages:
print(f"{msg['role']}: {msg['content'][:100]}...")
break
See examples/conversation_parsing.py for a complete working example.
Implementation Details¶
Performance Optimization¶
The implementation uses several optimizations for high performance:
- Custom fingerprinting: Avoids expensive dill pickling by hashing configuration parameters
- Lazy split creation: Datasets are created only when accessed
- Pre-indexed trees: Tree lookups use O(1) indexing instead of linear search
- Streaming datasets: Uses
IterableDatasetfor memory-efficient iteration
Performance metrics: - Dataset initialization: ~4 seconds (loads 13,854 trees) - Split creation: ~0.002 seconds (17,670x faster than naive implementation) - Example generation: <0.001 seconds per example
Tree Database¶
The TreeDatabase class provides efficient access to conversation trees:
Thread Generator¶
The ThreadGenerator class implements deterministic conversation sampling:
Deterministic Hashing¶
Python's built-in hash() is randomized for security. We use CRC32 for fast, deterministic hashing:
import zlib
def deterministic_hash(s: str) -> int:
"""Create deterministic hash using CRC32."""
return zlib.crc32(s.encode()) & 0xFFFFFFFF
Configuration Parameters¶
OpenAssistantConfig¶
| Parameter | Type | Default | Description |
|---|---|---|---|
input_file_path |
str | Auto-detected | Path to trees JSONL file |
cache_dir |
str | None | Cache directory for downloads |
languages |
List[str] | ['en'] |
Languages to include |
min_quality |
float | None | Minimum quality threshold |
min_thread_length |
int | 2 | Minimum conversation turns |
max_thread_length |
int | 7 | Maximum conversation turns |
exclude_deleted |
bool | True | Exclude deleted messages |
exclude_synthetic |
bool | True | Exclude synthetic messages |
branch_temperature |
float | 1.0 | Branching randomness (higher = more random) |
seed |
int | 42 | Random seed for reproducibility |
val_split |
int | 10 | Validation split percentage |
test_split |
int | 10 | Test split percentage |
OpenAssistantDatasetDict¶
| Parameter | Type | Default | Description |
|---|---|---|---|
tokenizer |
PreTrainedTokenizer | None | HuggingFace tokenizer (for template args) |
chat_template |
str | "" | Path to template file or empty for tokenizer's template |
**config_params |
- | - | All OpenAssistantConfig parameters |
Dataset Statistics¶
Source: OpenAssistant 2023-11-05 release Total trees: 13,854 conversation trees Languages: 35+ languages (English is most common) Quality scores: Community-labeled quality ratings Average tree depth: ~3-5 turns Max tree depth: 15+ turns in some cases
Default split distribution (with 10% val, 10% test): - Train: ~11,083 trees (80%) - Validation: ~1,385 trees (10%) - Test: ~1,386 trees (10%)
Common Use Cases¶
Multi-language Training¶
# Train on English, Spanish, and German
forgather -t openassistant.yaml dataset \
--languages en,es,de \
--dataset-length 100000
High-Quality Subset¶
# Use only high-quality conversations
forgather -t openassistant.yaml dataset \
--min-quality 0.7 \
--min-thread-length 3
Experimentation with Small Datasets¶
# Quick testing with small dataset
forgather -t openassistant.yaml dataset \
--dataset-length 1000 \
--seed 42
References¶
- OpenAssistant Project: https://github.com/LAION-AI/Open-Assistant
- Dataset: https://huggingface.co/datasets/OpenAssistant/oasst2
- Paper: "OpenAssistant Conversations - Democratizing Large Language Model Alignment (2023)"
License¶
The OpenAssistant dataset is released under Apache 2.0 license. This implementation is part of the Forgather framework.