Curator¶

The single entry point: configure with CuratorConfig, execute with Curator.run().

curatorkit.curator.CuratorConfig `dataclass` ¶

CuratorConfig(dataset: str | dict | list = '', split: str = 'train', subset: str | None = None, streaming: bool = False, hf_token: str | None = None, hf_subset: str | None = None, hf_columns: list[str] | None = None, max_samples: int | None = None, format: str = 'auto', field_mapping: dict[str, str] = dict(), preprocessing_fn: Callable | list[Callable | None] | None = None, min_tokens: int = 10, max_tokens: int = 2048, use_tiktoken: bool = False, schema_use_tiktoken: bool = False, schema_enforce_task_types: list[str] = list(), schema_gate: bool = True, schema_required_fields: list[str] = list(), dedup: str = 'exact', minhash_threshold: float = 0.85, minhash_ngram: int = 3, minhash_num_perm: int = 128, minhash_seed: int = 42, clean: bool = True, clean_transforms: dict[str, bool] = dict(), clean_fields: list[str] = list(), output_dir: str | Path = 'output', export_formats: list[str] = (lambda: ['alpaca', 'sharegpt', 'dpo'])(), output_split: dict[str, float] | None = None, output_split_seed: int = 42, name: str = 'curatorkit_run', detection_sample_size: int = 10, resample: bool = False, target_distribution: dict[str, float] = dict(), resample_field: str = 'source_dataset', resample_seed: int = 42, llm_model: str | None = None, llm_temperature: float = 0.7, llm_max_tokens: int = 1024, llm_api_key: str | None = None, llm_api_base: str | None = None, llm_concurrency: int = 10, llm_timeout: float = 120.0, llm_max_retries: int = 3, llm_drop_params: bool = True, llm_extra_body: dict = dict(), judge_llm_model: str | None = None, judge_llm_api_base: str | None = None, judge_llm_temperature: float = 0.1, judge_llm_max_tokens: int = 512, judge_llm_timeout: float = 120.0, judge_llm_max_retries: int = 3, judge_llm_extra_body: dict = dict(), generation_task: str | None = None, num_questions: int = 3, num_responses: int = 4, num_turns: int = 3, num_evolutions: int = 1, difficulty: str = 'medium', score_responses: bool = True, generate_answers: bool = True, cot_mode: str = 'generate', preference_mode: str = 'single_call', generation_concurrency: int | None = None, judge_concurrency: int | None = None, grpo_temperature_spread: float = 0.0, grpo_temperatures: list[float] | None = None, grpo_scoring_llm_model: str | None = None, qa_prompt_template: str | None = None, qa_table_prompt_template: str | None = None, evol_prompt_template: str | None = None, evol_answer_prompt_template: str | None = None, preference_prompt_template: str | None = None, preference_chosen_prompt: str | None = None, preference_rejected_prompt: str | None = None, grpo_prompt_template: str | None = None, multiturn_prompt_template: str | None = None, cot_prompt_template: str | None = None, cot_marker: str | None = None, adversarial_prompt_template: str | None = None, injection_rate: float = 0.5, injection_types: list[str] = list(), injection_seed: int = 42, high_temp: float = 1.4, hallucination_threshold: float | None = None, hallucination_prompt_template: str | None = None, reward_threshold: float | None = None, reward_dimensions: list[str] = (lambda: ['helpfulness', 'honesty', 'instruction_following'])(), reward_prompt_template: str | None = None, reward_store_score: bool = True, diversity_threshold: float | None = None, enable_reward_refiner: bool = False, reward_refine_prompt_template: str | None = None, reward_instruction_refine_template: str | None = None, embedding_model: str = 'sentence-transformers/all-MiniLM-L6-v2', diversity_embedding_model: str | None = None, embedding_dedup_model: str | None = None, embedding_device: str | None = None, embedding_batch_size: int = 64, embedding_dedup: bool = False, embedding_index_dir: str = 'output/embedding_index', embedding_dedup_threshold: float = 0.92, embedding_reset_index: bool = False, pdf_output_mode: str = 'chunk', pdf_chunk_strategy: str = 'heading', pdf_chunk_max_tokens: int = 512, pdf_chunk_overlap_tokens: int = 50, pdf_extract_tables: bool = False, pdf_ocr: bool = False, pdf_min_section_tokens: int = 30, enable_diagnostic_probe: bool = False, probe_temperatures: list[float] = (lambda: [0.3, 0.5])(), probe_generator_model: str | None = None, probe_score_split: float = 0.5, probe_extra_templates: dict[str, str] = dict(), llm_prompt_template: str | None = None, secrets_gate: bool = False, secrets_code_corpus_mode: bool = False, secrets_fields: list[str] = list(), secrets_hex_limit: float = 3.0, secrets_base64_limit: float = 4.5, pii_pseudonymize: bool = False, pii_entity_types: list[str] = list(), pii_fields: list[str] = list(), pii_score_threshold: float = 0.7, pii_spacy_model: str = 'en_core_web_lg', pii_faker_seed: int = 42, pii_language: str = 'en', toxicity_gate: bool = False, toxicity_classifier_pass_threshold: float = 0.1, toxicity_classifier_reject_threshold: float = 0.5, toxicity_detoxify_model: str = 'unbiased', toxicity_llm_judge: bool = False, toxicity_llm_reject_threshold: float = 0.5, toxicity_text_field: str = 'auto')

All configuration for a CuratorKIT pipeline in one place.

Minimal usage — just the dataset name: config = CuratorConfig(dataset="Anthropic/hh-rlhf")

LLM generation usage

config = CuratorConfig( dataset="docs/handbook.pdf", llm_model="openai/gpt-4o-mini", generation_task="qa", )

The rest has sensible defaults. Override only what you need.

── Source ────────────────────────────────────────────────────────────────── dataset HF Hub name, local file path, or list of either. Supported file types: .jsonl .json .csv .tsv .parquet .pdf Examples: "Anthropic/hh-rlhf" "data/my_data.jsonl" "docs/handbook.pdf" ["tatsu-lab/alpaca", "data/extra.jsonl"]

── LLM configuration ─────────────────────────────────────────────────────── llm_model LiteLLM model string. Default None (no generation). Examples: "openai/gpt-4o-mini", "anthropic/claude-sonnet-4-20250514", "ollama/llama3" llm_temperature Temperature for generation. Default 0.7. llm_max_tokens Max tokens per LLM call. Default 1024. llm_api_key API key override. Falls back to env var. llm_api_base Custom API base URL (for vLLM, Ollama, etc.) llm_concurrency Async worker pool size. Default 10.

── LLM generation ────────────────────────────────────────────────────────── generation_task Task type: "qa" | "preference" | "grpo" | "multiturn" | "evol" | "cot" | None (no generation). num_questions QA: questions per chunk. Default 3. num_responses GRPO: responses per prompt. Default 4. num_turns MultiTurn: turn pairs. Default 3. num_evolutions EvolInstruct: evolutions per instruction. Default 1. difficulty QA: "easy" | "medium" | "hard". Default "medium".

── Quality gates ─────────────────────────────────────────────────────────── hallucination_threshold Grounding score threshold. None = gate off (default). Set a float (e.g. 0.7) to enable the hallucination gate. reward_threshold Drop samples below this quality score. Default None (off). reward_dimensions Quality dimensions to evaluate. diversity_threshold Drop samples above this similarity. Default None (off).

── Cross-run dedup ───────────────────────────────────────────────────────── embedding_dedup Enable cross-run embedding dedup. Default False. embedding_index_dir Directory for persistent index. Default "output/embedding_index". embedding_threshold Similarity threshold. Default 0.92.

apply_patch ¶

apply_patch(patch: dict) -> CuratorConfig

Return a copy of this config with patch fields applied.

Supported patch keys

llm_temperature → self.llm_temperature prompt_template → self.llm_prompt_template

Returns a new CuratorConfig — the original is unchanged.

curatorkit.curator.Curator ¶

Curator(config: CuratorConfig)

CuratorKIT pipeline runner.

Curation-only usage (no LLM): curator = Curator(CuratorConfig(dataset="Anthropic/hh-rlhf")) result = curator.run()

LLM generation usage

curator = Curator(CuratorConfig( dataset="docs/handbook.pdf", llm_model="openai/gpt-4o-mini", generation_task="qa", hallucination_threshold=0.7, reward_threshold=0.7, )) result = curator.run() # sync result = await curator.run_async() # async (faster for many LLM calls)

dry_run ¶

dry_run() -> list[dict[str, str]]

Build the step list from CuratorConfig and print the plan without running.

Returns the same plan list that Pipeline.dry_run() returns so callers can diff plans across config changes.

run ¶

run() -> CuratorResult

Synchronous pipeline execution.

Automatically uses async execution (and therefore concurrent LLM calls) when a generation task is configured. Callers never need to reach for run_async() themselves.

run_async `async` ¶

run_async() -> CuratorResult

Async pipeline execution — faster for generation-heavy pipelines.

curatorkit.curator.CuratorResult `dataclass` ¶

CuratorResult(passed: list, rejected: list, stage_counts: dict[str, dict[str, int]], output_dir: Path, wall_clock_seconds: float, diagnostics: Any = None)

Everything the pipeline produced, in one object.

passed — list of DataSample objects that cleared every filter rejected — list of RejectedSample objects with structured reasons stage_counts — per-step input/output/rejected counts output_dir — where the files were written wall_clock_seconds — total run time diagnostics — PipelineDiagnostics when enable_diagnostic_probe=True, else None

sample ¶

sample(n: int = 3) -> None

Print the first n passed samples.