Skip to content

CuratorConfig Parameter Reference

All parameters for the CuratorConfig dataclass, grouped by category. Every field has a default — none are required. The minimal working config is just dataset + llm_model + generation_task.

from curatorkit import Curator, CuratorConfig

Data source

Parameter Type Default Description
dataset str \| dict \| list "" Input path, HuggingFace dataset name, dict with reader options, or list for multi-source
split str "train" Dataset split when loading from HuggingFace ("train", "test", etc.)
subset str \| None None HuggingFace dataset configuration name (e.g. "en" for multilingual datasets)
streaming bool False Stream the dataset instead of loading into memory (HuggingFace only)
hf_token str \| None None HuggingFace API token for gated datasets
hf_subset str \| None None Alias for subset
hf_columns list[str] \| None None Load only these columns (reduces memory for large HF datasets)
max_samples int \| None None Hard cap on the final sample count, applied late (after gates, dedup, and resampling) so distributions are preserved. To cap how much is read from a large source, use the per-reader form: dataset={"name": "...", "max_samples": N}.

Column mapping

Parameter Type Default Description
format str "auto" Force a specific input format: "alpaca", "sharegpt", "preference", "implicit_preference", "unpaired_preference", "grpo", "prompt_only", "pretrain". "auto" detects from column names.
field_mapping dict[str, str] {} Rename source columns to DataSample fields, as {source_column: datasample_field} — keys are your columns. E.g. {"user_query": "instruction", "response": "output"}. Keys may use dot notation for nested dict values: {"meta.prompt": "instruction"}.
preprocessing_fn Callable \| list \| None None Function (dict) -> dict \| None applied to every raw row. Return None to drop.

Quality filters (pre-generation)

Parameter Type Default Description
min_tokens int 10 Minimum token count. Samples below this are rejected by SchemaGate.
max_tokens int 2048 Maximum token count. Samples above this are rejected by SchemaGate.
use_tiktoken bool False Use tiktoken for token counting instead of whitespace-based estimation
schema_use_tiktoken bool False Use tiktoken specifically for SchemaGate (overrides use_tiktoken for gate only)
schema_enforce_task_types list[str] [] Only pass samples with these task_type values. Empty = pass all.
schema_gate bool True Enable SchemaGate. Set False to skip all schema checks.
schema_required_fields list[str] [] Additional DataSample fields that must be non-empty to pass SchemaGate

Deduplication

Parameter Type Default Description
dedup str "exact" Deduplication strategy: "exact", "minhash", "none"
minhash_threshold float 0.85 Jaccard similarity threshold for MinHash dedup. Higher = more aggressive.
minhash_ngram int 3 N-gram size for MinHash shingle construction
minhash_num_perm int 128 Number of hash permutations. Higher = more accurate but slower.
minhash_seed int 42 Random seed for MinHash permutations

Text cleaning

Parameter Type Default Description
clean bool True Enable text cleaning
clean_transforms dict[str, bool] {} Toggle specific transforms. Keys: "normalise_unicode", "fix_encoding_artifacts", "collapse_whitespace", "remove_control_chars", "strip_html". Default: all enabled.
clean_fields list[str] [] Apply cleaning to these DataSample fields. Empty = applies to instruction, input, and output.

Data hygiene (pre-generation)

Hygiene steps run after text cleaning and before any generation task. Execution order is fixed: SecretsGate → PIIPseudonymizer → ToxicityGate. See Data hygiene for full usage examples.

SecretsGate

Parameter Type Default Description
secrets_gate bool False Enable SecretsGate. Rejects samples containing API keys, tokens, private keys, or high-entropy secrets detected by the detect-secrets battery.
secrets_code_corpus_mode bool False Enable KeywordDetector in addition to entropy-based detectors. Set True for code datasets; leave False for prose (too noisy).

PIIPseudonymizer

Parameter Type Default Description
pii_pseudonymize bool False Enable PIIPseudonymizer. Replaces detected PII entities with consistent Faker-generated values. Modifies samples in-place; does not reject.
pii_entity_types list[str] [] Presidio entity types to detect. Empty list = default set (PERSON, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, CREDIT_CARD, IBAN_CODE, IP_ADDRESS). Use ENTITY_TYPES_CLINICAL from curatorkit.hygiene.pii for medical/legal corpora.
pii_score_threshold float 0.7 Presidio confidence threshold. Detections below this score are ignored. Lower values catch more PII at the cost of higher false-positive rate.
pii_spacy_model str "en_core_web_lg" spaCy model for NER. "en_core_web_lg" for production; "en_core_web_sm" for dev/CI.
pii_faker_seed int 42 Seed for Faker replacements. Same seed = same fake entities across runs for reproducibility.

ToxicityGate

Parameter Type Default Description
toxicity_gate bool False Enable ToxicityGate. Two-stage: Stage 1 (local Detoxify classifier) first, then optional Stage 2 (LLM judge) for borderline samples.
toxicity_classifier_pass_threshold float 0.1 Max-dimension Detoxify score below this → immediate pass (no LLM call).
toxicity_classifier_reject_threshold float 0.5 Max-dimension Detoxify score above this → immediate reject (no LLM call). Scores between the two thresholds are escalated to the LLM judge if toxicity_llm_judge=True.
toxicity_detoxify_model str "unbiased" Detoxify model variant. "unbiased" for general corpora; "original" for informal web text; "multilingual" for non-English.
toxicity_llm_judge bool False Enable LLM second opinion for borderline samples. Requires llm_model to be set.
toxicity_llm_reject_threshold float 0.5 LLM judge score at or above which a borderline sample is rejected (only used when toxicity_llm_judge=True).

Export

Parameter Type Default Description
output_dir str \| Path "output" Directory for all output files
export_formats list[str] ["alpaca", "sharegpt", "dpo"] Which exporters to run. Options: "alpaca", "sharegpt", "dpo", "grpo", "ppo", "corpus"
output_split dict[str, float] \| None None Split accepted samples into subdirectories. E.g. {"train": 0.8, "val": 0.1, "test": 0.1}. Must sum to 1.0.
output_split_seed int 42 Seed for the pre-split shuffle. Set the same value across runs to get identical train/val/test assignments.

Misc

Parameter Type Default Description
name str "curatorkit_run" Pipeline name, written to manifest.json and dataset_card.md
detection_sample_size int 10 Number of rows to inspect for auto-format detection

Resampling

Parameter Type Default Description
resample bool False Enable stratified resampling to match target_distribution
target_distribution dict[str, float] {} Target class proportions. E.g. {"finance": 0.5, "legal": 0.3, "general": 0.2}. Must sum to 1.0.
resample_field str "source_dataset" DataSample metadata field used as the stratification key
resample_seed int 42 Random seed for resampling

Generator LLM

Parameter Type Default Description
llm_model str \| None None LiteLLM model string. E.g. "openai/gpt-4o-mini", "ollama/llama3.1:8b". Required for generation tasks.
llm_temperature float 0.7 Sampling temperature for generation
llm_max_tokens int 1024 Max tokens per LLM response
llm_api_key str \| None None API key. Falls back to environment variable (e.g. OPENAI_API_KEY) if None.
llm_api_base str \| None None Custom base URL for OpenAI-compatible endpoints (vLLM, Ollama, etc.)
llm_concurrency int 10 Max parallel LLM requests for generation
llm_timeout float 120.0 Request timeout in seconds
llm_max_retries int 3 Retry attempts on transient errors
llm_drop_params bool True Silently drop unsupported params rather than raising (LiteLLM behaviour)
llm_extra_body dict {} Extra fields forwarded verbatim in the API request body (e.g. {"chat_template_kwargs": {"enable_thinking": True}})

Judge LLM

Used by HallucinationGate and RewardGate. Defaults to the generator LLM when not set.

Parameter Type Default Description
judge_llm_model str \| None None Model for quality scoring. None = use llm_model. Recommended: set a separate model to avoid self-leniency bias.
judge_llm_api_base str \| None None Base URL for judge model endpoint
judge_llm_temperature float 0.1 Low temperature for deterministic scoring
judge_llm_max_tokens int 512 Max tokens for judge responses (scores are short)
judge_llm_timeout float 120.0 Judge request timeout in seconds
judge_llm_max_retries int 3 Retry attempts for judge requests
judge_llm_extra_body dict {} Extra body params for judge model (e.g. disable thinking mode for structured output)

Generation task

Parameter Type Default Description
generation_task str \| None None Which task to run: "qa", "preference", "grpo", "multiturn", "evol", "cot", "adversarial_preference", "adversarial_qa". None = no generation, cleaning/filtering only.
num_questions int 3 Number of QA pairs or questions to generate per source chunk (qa, adversarial_qa)
num_responses int 4 Number of rollout responses per prompt (grpo)
num_turns int 3 Number of turns in a multi-turn conversation (multiturn)
num_evolutions int 1 Number of evolution rounds per instruction (evol)
difficulty str "medium" Question difficulty level: "easy", "medium", "hard" (qa, adversarial_qa)
score_responses bool True Score each rollout response after generation (grpo)
generate_answers bool True Generate answers for evolved instructions (evol)
cot_mode str "generate" Chain-of-thought mode: "generate" (LLM produces CoT from scratch) or "wrap" (wrap an existing answer with reasoning)
cot_marker str \| None None Separator inserted between the reasoning block and the final answer in cot output. None = the built-in separator ("\n\n## Answer\n").
preference_mode str "single_call" Preference pair generation: "single_call" (one call for both chosen/rejected) or "two_pass" (separate calls)
generation_concurrency int \| None None Concurrency for the generation task. None = use llm_concurrency.
judge_concurrency int \| None None Concurrency for scoring/judging calls. None = use llm_concurrency.

GRPO-specific

Parameter Type Default Description
grpo_temperature_spread float 0.0 Spread applied around llm_temperature to create a range of rollout temperatures. 0.0 = all rollouts use the same temperature.
grpo_temperatures list[float] \| None None Explicit per-rollout temperatures. Overrides grpo_temperature_spread. Cycled if shorter than num_responses.
grpo_scoring_llm_model str \| None None Model for scoring GRPO responses. None = use llm_model (the generator).

Prompt templates

Override the default LLM prompt for each task. If you provide a template, all required placeholder variables must be present or construction raises ValueError.

Parameter Required variables Task
qa_prompt_template {context}, {num_questions} (optional: {difficulty} — if absent and difficulty != "medium", a difficulty hint is appended as a suffix) qa
qa_table_prompt_template {context}, {num_questions} (optional: {difficulty}) qa — separate prompt used for table-derived chunks
evol_prompt_template {instruction}, {strategy}, {context} evol
evol_answer_prompt_template {instruction} evol — answer pass for evolved instructions without source context (requires generate_answers=True)
preference_prompt_template {instruction}, {context_section} preference (single_call mode)
preference_chosen_prompt {instruction}, {context_section} preference (two_pass mode) — chosen-response generation
preference_rejected_prompt {instruction}, {context_section} preference (two_pass mode) — rejected-response generation
grpo_prompt_template {instruction} grpo
multiturn_prompt_template {num_turns}, {context_section}, {initial_question} multiturn (only used in single_call mode — not active via CuratorConfig, which always uses turn_by_turn; see the customisation guide)
cot_prompt_template {instruction} (generate mode); {instruction}, {answer} (wrap mode) cot
adversarial_prompt_template {context}, {question}, {injection_type} adversarial_preference (template for the adversarially-corrupted rejected response)
llm_prompt_template task-dependent fallback for unlisted tasks

All default to None (built-in template used).


Adversarial generation

Parameter Type Default Description
injection_rate float 0.5 Fraction of questions to inject adversarial hallucinations into (0.0–1.0)
injection_types list[str] [] Which injection strategies to use. For adversarial_preference: "contradicts_source", "parametric_drift", "domain_mismatch", "instruction_quality". For adversarial_qa: same four plus "high_temperature_drift". Empty = all types for the active task.
injection_seed int 42 Random seed for injection sampling
high_temp float 1.4 Sampling temperature used for the high_temperature_drift injection type (adversarial_qa only)

Quality gates (post-generation)

Parameter Type Default Description
hallucination_threshold float \| None None Enable HallucinationGate. Samples below this score are rejected. Range 0.0–1.0. None = gate disabled.
hallucination_prompt_template str \| None None Custom hallucination judge prompt. Required variables: {source}, {claim}.
reward_threshold float \| None None Enable RewardGate. Samples below this score are rejected. Range 0.0–1.0. None = gate disabled.
reward_dimensions list[str] ["helpfulness", "honesty", "instruction_following"] Built-in scoring axes. Valid values: "helpfulness", "honesty", "instruction_following", "truthfulness", "depth", "creativity", "coherence".
reward_prompt_template str \| None None Replace the entire reward judge prompt with a custom rubric. Required variables: {instruction}, {response}. Must return {"score": 0.XX, "reasoning": "..."}.
reward_store_score bool True Write the overall reward score to each sample's label field (used by the reward refiner and downstream filtering)
diversity_threshold float \| None None Enable DiversityGate. Samples with cosine similarity above this to any accepted sample are rejected. Range 0.0–1.0. None = gate disabled.

Inline recovery (adaptive probe)

Parameter Type Default Description
enable_diagnostic_probe bool False Enable the inline probe. Fires after each gate's rejects, before the next gate sees samples.
probe_temperatures list[float] [0.3, 0.5] Temperature values to try during the temperature sweep recovery path
probe_generator_model str \| None None Model for probe re-generation. None = use llm_model.
probe_score_split float 0.5 Score boundary for routing. Samples above this go to the temperature path; below go to the strict grounding path.
probe_extra_templates dict[str, str] {} Override built-in probe templates ("default", "strict_grounding", "domain_specific") or add new keys, selected per sample via metadata["domain_prompt_key"]. See Customisation for routing details.
enable_reward_refiner bool False Enable the post-pipeline reward refiner (rewrites answers targeting the weakest quality dimension)
reward_refine_prompt_template str \| None None Custom refiner prompt. None = built-in template.
reward_instruction_refine_template str \| None None Custom template for instruction rewrites (used when instruction_quality failure mode is detected). None = built-in.

Embedding (diversity gate + dedup)

Parameter Type Default Description
embedding_model str "sentence-transformers/all-MiniLM-L6-v2" SentenceTransformer model for diversity gate and embedding dedup
diversity_embedding_model str \| None None Override embedding_model for the diversity gate only
embedding_dedup_model str \| None None Override embedding_model for the embedding dedup normalizer only
embedding_device str \| None None Device for embedding inference: "cpu", "cuda", "mps". None = auto-detect.
embedding_batch_size int 64 Batch size for embedding model inference

Cross-run embedding dedup

Persists an FAISS index across runs to reject samples already seen in previous runs.

Parameter Type Default Description
embedding_dedup bool False Enable cross-run embedding dedup
embedding_index_dir str "output/embedding_index" Directory where the FAISS index is stored and loaded
embedding_dedup_threshold float 0.92 Cosine similarity threshold. Samples above this are rejected as near-duplicates of previously indexed samples.
embedding_reset_index bool False Delete and rebuild the index at the start of this run

PDF extraction

Controls how PDFs are chunked when dataset points to a .pdf file.

Parameter Type Default Description
pdf_output_mode str "chunk" How to yield content: "chunk" (raw source chunks), "qa", "preference", "grpo", or "multiturn" (LLM generation runs inline in the reader). Use "chunk" and set generation_task separately — inline modes are legacy.
pdf_chunk_strategy str "heading" Chunking strategy: "heading" (split on heading boundaries), "sentence" (sentence boundaries), or "fixed" (fixed token size)
pdf_chunk_max_tokens int 512 Target max tokens per chunk
pdf_chunk_overlap_tokens int 50 Token overlap between adjacent chunks
pdf_extract_tables bool False Include table cells as text in chunks
pdf_ocr bool False Run OCR on scanned pages (requires MinerU GPU setup)
pdf_min_section_tokens int 30 Discard sections shorter than this (headers, footers, captions)

See also