All parameters for the CuratorConfig dataclass, grouped by category. Every field has a default — none are required. The minimal working config is just dataset + llm_model + generation_task.
Input path, HuggingFace dataset name, dict with reader options, or list for multi-source
split
str
"train"
Dataset split when loading from HuggingFace ("train", "test", etc.)
subset
str \| None
None
HuggingFace dataset configuration name (e.g. "en" for multilingual datasets)
streaming
bool
False
Stream the dataset instead of loading into memory (HuggingFace only)
hf_token
str \| None
None
HuggingFace API token for gated datasets
hf_subset
str \| None
None
Alias for subset
hf_columns
list[str] \| None
None
Load only these columns (reduces memory for large HF datasets)
max_samples
int \| None
None
Hard cap on the final sample count, applied late (after gates, dedup, and resampling) so distributions are preserved. To cap how much is read from a large source, use the per-reader form: dataset={"name": "...", "max_samples": N}.
Force a specific input format: "alpaca", "sharegpt", "preference", "implicit_preference", "unpaired_preference", "grpo", "prompt_only", "pretrain". "auto" detects from column names.
field_mapping
dict[str, str]
{}
Rename source columns to DataSample fields, as {source_column: datasample_field} — keys are your columns. E.g. {"user_query": "instruction", "response": "output"}. Keys may use dot notation for nested dict values: {"meta.prompt": "instruction"}.
preprocessing_fn
Callable \| list \| None
None
Function (dict) -> dict \| None applied to every raw row. Return None to drop.
Hygiene steps run after text cleaning and before any generation task. Execution order is
fixed: SecretsGate → PIIPseudonymizer → ToxicityGate. See Data hygiene
for full usage examples.
Enable PIIPseudonymizer. Replaces detected PII entities with consistent Faker-generated values. Modifies samples in-place; does not reject.
pii_entity_types
list[str]
[]
Presidio entity types to detect. Empty list = default set (PERSON, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, CREDIT_CARD, IBAN_CODE, IP_ADDRESS). Use ENTITY_TYPES_CLINICAL from curatorkit.hygiene.pii for medical/legal corpora.
pii_score_threshold
float
0.7
Presidio confidence threshold. Detections below this score are ignored. Lower values catch more PII at the cost of higher false-positive rate.
pii_spacy_model
str
"en_core_web_lg"
spaCy model for NER. "en_core_web_lg" for production; "en_core_web_sm" for dev/CI.
pii_faker_seed
int
42
Seed for Faker replacements. Same seed = same fake entities across runs for reproducibility.
Enable ToxicityGate. Two-stage: Stage 1 (local Detoxify classifier) first, then optional Stage 2 (LLM judge) for borderline samples.
toxicity_classifier_pass_threshold
float
0.1
Max-dimension Detoxify score below this → immediate pass (no LLM call).
toxicity_classifier_reject_threshold
float
0.5
Max-dimension Detoxify score above this → immediate reject (no LLM call). Scores between the two thresholds are escalated to the LLM judge if toxicity_llm_judge=True.
toxicity_detoxify_model
str
"unbiased"
Detoxify model variant. "unbiased" for general corpora; "original" for informal web text; "multilingual" for non-English.
toxicity_llm_judge
bool
False
Enable LLM second opinion for borderline samples. Requires llm_model to be set.
toxicity_llm_reject_threshold
float
0.5
LLM judge score at or above which a borderline sample is rejected (only used when toxicity_llm_judge=True).
Override the default LLM prompt for each task. If you provide a template, all required placeholder variables must be present or construction raises ValueError.
Parameter
Required variables
Task
qa_prompt_template
{context}, {num_questions} (optional: {difficulty} — if absent and difficulty != "medium", a difficulty hint is appended as a suffix)
Fraction of questions to inject adversarial hallucinations into (0.0–1.0)
injection_types
list[str]
[]
Which injection strategies to use. For adversarial_preference: "contradicts_source", "parametric_drift", "domain_mismatch", "instruction_quality". For adversarial_qa: same four plus "high_temperature_drift". Empty = all types for the active task.
injection_seed
int
42
Random seed for injection sampling
high_temp
float
1.4
Sampling temperature used for the high_temperature_drift injection type (adversarial_qa only)
Replace the entire reward judge prompt with a custom rubric. Required variables: {instruction}, {response}. Must return {"score": 0.XX, "reasoning": "..."}.
reward_store_score
bool
True
Write the overall reward score to each sample's label field (used by the reward refiner and downstream filtering)
diversity_threshold
float \| None
None
Enable DiversityGate. Samples with cosine similarity above this to any accepted sample are rejected. Range 0.0–1.0. None = gate disabled.
Enable the inline probe. Fires after each gate's rejects, before the next gate sees samples.
probe_temperatures
list[float]
[0.3, 0.5]
Temperature values to try during the temperature sweep recovery path
probe_generator_model
str \| None
None
Model for probe re-generation. None = use llm_model.
probe_score_split
float
0.5
Score boundary for routing. Samples above this go to the temperature path; below go to the strict grounding path.
probe_extra_templates
dict[str, str]
{}
Override built-in probe templates ("default", "strict_grounding", "domain_specific") or add new keys, selected per sample via metadata["domain_prompt_key"]. See Customisation for routing details.
enable_reward_refiner
bool
False
Enable the post-pipeline reward refiner (rewrites answers targeting the weakest quality dimension)
reward_refine_prompt_template
str \| None
None
Custom refiner prompt. None = built-in template.
reward_instruction_refine_template
str \| None
None
Custom template for instruction rewrites (used when instruction_quality failure mode is detected). None = built-in.
Controls how PDFs are chunked when dataset points to a .pdf file.
Parameter
Type
Default
Description
pdf_output_mode
str
"chunk"
How to yield content: "chunk" (raw source chunks), "qa", "preference", "grpo", or "multiturn" (LLM generation runs inline in the reader). Use "chunk" and set generation_task separately — inline modes are legacy.
pdf_chunk_strategy
str
"heading"
Chunking strategy: "heading" (split on heading boundaries), "sentence" (sentence boundaries), or "fixed" (fixed token size)
pdf_chunk_max_tokens
int
512
Target max tokens per chunk
pdf_chunk_overlap_tokens
int
50
Token overlap between adjacent chunks
pdf_extract_tables
bool
False
Include table cells as text in chunks
pdf_ocr
bool
False
Run OCR on scanned pages (requires MinerU GPU setup)
pdf_min_section_tokens
int
30
Discard sections shorter than this (headers, footers, captions)