Getting started¶
CuratorKIT builds post-training datasets as a gated pipeline: ingest from any source, clean and deduplicate, generate synthetic data with an LLM, verify every generated sample against its source, recover salvageable rejects, and export in the format your trainer expects.
Install¶
pip install "curatorkit[all]" # connectors + generation + embedding + hygiene
pip install curatorkit # core only: cleaning and dedup
Requires Python 3.11+. The installation page maps every extra to what it unlocks.
First run: clean a dataset¶
No LLM or API key needed. Reads any supported format, deduplicates, cleans, exports:
from curatorkit import Curator, CuratorConfig
result = Curator(CuratorConfig(
dataset = {"name": "tatsu-lab/alpaca", "max_samples": 2000},
dedup = "minhash",
clean = True,
export_formats = ["alpaca", "sharegpt"],
output_dir = "output/clean",
)).run()
result.print_summary()
HuggingFace Hub sources need the connectors extra (included in all); local
JSONL, JSON, and CSV files run on the core install.
Second run: generate gated synthetic data¶
Set an API key (any LiteLLM provider, or point at local Ollama/vLLM), then:
result = Curator(CuratorConfig(
dataset = "handbook.pdf", # needs the [pdf] extra
llm_model = "openai/gpt-4o-mini",
generation_task = "qa",
hallucination_threshold = 0.7, # verify answers against the source
reward_threshold = 0.7, # LLM-judge quality gate
export_formats = ["alpaca"],
output_dir = "output/qa",
)).run()
The quickstart extends this with adaptive recovery and the CLI.
What a run produces¶
Every run writes the export files plus four provenance artifacts:
output/
sft_alpaca.jsonl exported training data (one file per requested format)
manifest.json config hash, per-stage counts, rejection breakdown
rejected.jsonl every rejected sample with a structured reason
dataset_card.md human-readable run summary
checksums.txt SHA-256 for all output files
result.passed, result.rejected, and result.stage_counts expose the same
information in code. Reading the output walks through each file.
Go deeper¶
| Each pipeline stage in depth | Guides |
Every CuratorConfig parameter |
Configuration reference |
| YAML pipelines and CLI flags | CLI reference |
| Runnable notebooks | Tutorials |
| Classes and functions | API reference |