Forgather Dataset Definitions¶
Datasets¶
- roneneldan - TinyStories: synthetically generated short stories (GPT-3.5/4) using a small vocabulary. The default dataset for quick model validation and the tiny_experiments collection.
- allenai - C4 (Colossal Clean Crawled Corpus): large-scale web text from AllenAI, available in standard and packed variants.
- EleutherAI - Wikitext-103: document-level Wikipedia text from EleutherAI, commonly used as a language modeling benchmark.
- HuggingFaceTB - SmolLM-Corpus: curated high-quality educational and synthetic data designed for training small language models.
- wikipedia - Wikimedia Wikipedia dumps (English, November 2023 snapshot).
- Open-Orca - Open-Orca: augmented FLAN examples distilled through GPT-4/3.5, heavy on chain-of-thought and structured reasoning prompts.
- open-thoughts - OpenThoughts3-1.2M: 1.2M reasoning conversations (math, code, science) annotated by QwQ-32B, in ShareGPT-style multi-turn format.
- OpenAssistant - OpenAssistant: conversational AI dataset from the LAION AI community, constructed from human-annotated conversation trees.
- QuixiAI - Samantha: Eric Hartford's conversational persona dataset (formerly Cognitive Computations).
- ajibawa-2023 - General-Stories-Collection and related datasets by Feynman Innovations.
- local_dataset - Template project for loading local datasets via
datasets.load_from_disk.
For the dataset project CLI reference — targets, inspection commands, histogram generation, and the forgather dataset subcommand — see docs/datasets/dataset-cli.md.