Quickstart¶
CuratorConfig holds all configuration; Curator.run() executes the pipeline.
Three patterns cover most uses.
Clean and deduplicate an existing dataset¶
No LLM or API key required. Reads any supported format, deduplicates, cleans, exports.
from curatorkit import Curator, CuratorConfig
result = Curator(CuratorConfig(
dataset = {"name": "tatsu-lab/alpaca", # HF Hub name, local file, or list
"max_samples": 2000}, # cap how much is read
dedup = "minhash", # "exact" | "minhash" | "none"
clean = True,
export_formats = ["alpaca", "sharegpt"],
output_dir = "output/clean",
)).run()
result.print_summary()
HuggingFace Hub sources need the connectors extra (included in all); local
JSONL, JSON, and CSV files run on the core install. Drop max_samples to process
the full dataset.
Generate synthetic data from a document¶
Reads a PDF, generates QA pairs, and verifies each answer is grounded in the source
chunk it was generated from. Requires the generation and pdf extras, plus an API
key in the provider's standard environment variable (OPENAI_API_KEY here); any
LiteLLM backend or local Ollama/vLLM server works.
result = Curator(CuratorConfig(
dataset = "docs/handbook.pdf",
llm_model = "openai/gpt-4o-mini",
generation_task = "qa",
num_questions = 3,
hallucination_threshold = 0.7, # drop answers that aren't grounded
export_formats = ["alpaca"],
output_dir = "output/qa",
)).run()
Generate, filter, and recover failures¶
Adds a reward quality gate plus the two recovery mechanisms: the diagnostic probe classifies each rejection into a failure mode and retries the fixable ones during the run, and the reward refiner re-scores borderline rejects afterwards. The adaptive recovery guide explains both.
result = Curator(CuratorConfig(
dataset = "docs/handbook.pdf",
llm_model = "openai/gpt-4o-mini",
judge_llm_model = "openai/gpt-4o", # separate judge avoids self-leniency
generation_task = "qa",
hallucination_threshold = 0.7,
reward_threshold = 0.7,
enable_diagnostic_probe = True,
enable_reward_refiner = True,
export_formats = ["alpaca", "sharegpt"],
output_dir = "output/qa_full",
)).run()
print(result.diagnostics.to_dict())
From the command line¶
The same pipelines run declaratively from YAML:
The repository ships a runnable example that needs no API key:
examples/quickstart/.
See the CLI reference for the YAML schema.
Next: Reading the output