Skip to content

Getting started

CuratorKIT builds post-training datasets as a gated pipeline: ingest from any source, clean and deduplicate, generate synthetic data with an LLM, verify every generated sample against its source, recover salvageable rejects, and export in the format your trainer expects.

Install

pip install "curatorkit[all]"        # connectors + generation + embedding + hygiene
pip install curatorkit               # core only: cleaning and dedup

Requires Python 3.11+. The installation page maps every extra to what it unlocks.

First run: clean a dataset

No LLM or API key needed. Reads any supported format, deduplicates, cleans, exports:

from curatorkit import Curator, CuratorConfig

result = Curator(CuratorConfig(
    dataset        = {"name": "tatsu-lab/alpaca", "max_samples": 2000},
    dedup          = "minhash",
    clean          = True,
    export_formats = ["alpaca", "sharegpt"],
    output_dir     = "output/clean",
)).run()

result.print_summary()

HuggingFace Hub sources need the connectors extra (included in all); local JSONL, JSON, and CSV files run on the core install.

Second run: generate gated synthetic data

Set an API key (any LiteLLM provider, or point at local Ollama/vLLM), then:

result = Curator(CuratorConfig(
    dataset                 = "handbook.pdf",      # needs the [pdf] extra
    llm_model               = "openai/gpt-4o-mini",
    generation_task         = "qa",
    hallucination_threshold = 0.7,                 # verify answers against the source
    reward_threshold        = 0.7,                 # LLM-judge quality gate
    export_formats          = ["alpaca"],
    output_dir              = "output/qa",
)).run()

The quickstart extends this with adaptive recovery and the CLI.

What a run produces

Every run writes the export files plus four provenance artifacts:

output/
  sft_alpaca.jsonl     exported training data (one file per requested format)
  manifest.json        config hash, per-stage counts, rejection breakdown
  rejected.jsonl       every rejected sample with a structured reason
  dataset_card.md      human-readable run summary
  checksums.txt        SHA-256 for all output files

result.passed, result.rejected, and result.stage_counts expose the same information in code. Reading the output walks through each file.

Go deeper

Each pipeline stage in depth Guides
Every CuratorConfig parameter Configuration reference
YAML pipelines and CLI flags CLI reference
Runnable notebooks Tutorials
Classes and functions API reference