Skip to content

Forgather Dataset Definitions

Datasets

  • roneneldan - TinyStories: synthetically generated short stories (GPT-3.5/4) using a small vocabulary. The default dataset for quick model validation and the tiny_experiments collection.
  • allenai - C4 (Colossal Clean Crawled Corpus): large-scale web text from AllenAI, available in standard and packed variants.
  • EleutherAI - Wikitext-103: document-level Wikipedia text from EleutherAI, commonly used as a language modeling benchmark.
  • HuggingFaceTB - SmolLM-Corpus: curated high-quality educational and synthetic data designed for training small language models.
  • wikipedia - Wikimedia Wikipedia dumps (English, November 2023 snapshot).
  • Open-Orca - Open-Orca: augmented FLAN examples distilled through GPT-4/3.5, heavy on chain-of-thought and structured reasoning prompts.
  • open-thoughts - OpenThoughts3-1.2M: 1.2M reasoning conversations (math, code, science) annotated by QwQ-32B, in ShareGPT-style multi-turn format.
  • OpenAssistant - OpenAssistant: conversational AI dataset from the LAION AI community, constructed from human-annotated conversation trees.
  • QuixiAI - Samantha: Eric Hartford's conversational persona dataset (formerly Cognitive Computations).
  • ajibawa-2023 - General-Stories-Collection and related datasets by Feynman Innovations.
  • local_dataset - Template project for loading local datasets via datasets.load_from_disk.

For the dataset project CLI reference — targets, inspection commands, histogram generation, and the forgather dataset subcommand — see docs/datasets/dataset-cli.md.