CLI and YAML Pipeline¶
CuratorKIT has two entry points:
| Channel | When to use |
|---|---|
Python API (CuratorConfig + Curator) |
Scripting, notebooks, programmatic control |
CLI + YAML (curatorkit run pipeline.yaml) |
Reproducible runs, CI/CD, no-code configuration |
Both produce identical output. The YAML config maps to the same pipeline steps as CuratorConfig.
CLI commands¶
run — execute a pipeline¶
curatorkit run pipeline.yaml
curatorkit run pipeline.yaml --output-dir output/run1
curatorkit run pipeline.yaml --dry-run # validate config and print plan, no execution
curatorkit run pipeline.yaml --async # async runner (faster for generation tasks)
curatorkit run pipeline.yaml --verbose # show per-stage counts during run
curatorkit run pipeline.yaml --reset-index # clear persistent embedding index before run
--dry-run is useful before an expensive generation run — it validates the YAML, prints every step in order with its config, and exits. No LLM calls are made.
setup-pdf — verify the PDF parsing setup¶
PDF parsing uses MinerU, installed via the pdf extra. Model weights download automatically on the first PDF parse — there is no manual download step.
pip install "curatorkit[pdf]"
curatorkit setup-pdf # verify MinerU is installed and importable
curatorkit setup-pdf --check # same verification; --check is the only flag
If MinerU is missing, the command exits non-zero and prints the install instructions above.
YAML pipeline config¶
The YAML file is parsed and validated by Pydantic before any step runs. A malformed config produces a clear error immediately — not a crash mid-pipeline.
Minimal example — clean and deduplicate¶
name: clean_alpaca
version: "1.0"
readers:
- type: huggingface
path: tatsu-lab/alpaca
hf_split: train
gates:
- type: schema
min_tokens: 10
max_tokens: 2048
normalizers:
- type: exact_dedup
- type: text_cleaner
exporters:
- type: alpaca
- type: sharegpt
Full example — QA generation with filtering and recovery¶
name: qa_handbook
version: "1.0"
# Global LLM config — shared by all generation tasks and gates unless overridden
llm:
model: openai/gpt-4o-mini
temperature: 0.7
max_tokens: 1024
concurrency: 10
api_base: null # null = use provider default
readers:
- type: pdf
path: docs/handbook.pdf
chunk_strategy: heading
chunk_max_tokens: 512
chunk_overlap_tokens: 50
gates:
- type: schema
min_tokens: 20
max_tokens: 4096
- type: hallucination
hallucination_threshold: 0.7
hallucination_llm_model: null # null = use global llm.model
- type: reward
reward_threshold: 0.7
reward_dimensions: [helpfulness, honesty, instruction_following, depth]
reward_llm_model: null
normalizers:
- type: exact_dedup
- type: minhash_dedup
minhash_threshold: 0.85
- type: text_cleaner
generators:
- type: qa
num_questions: 3
difficulty: medium
# Adaptive recovery — attaches to HallucinationGate
diagnostic:
enable_probe: true
probe_temperatures: [0.3, 0.5]
score_split: 0.5
exporters:
- type: alpaca
- type: sharegpt
output_split:
train: 0.8
val: 0.1
test: 0.1
DPO preference pair example¶
name: dpo_from_pdf
version: "1.0"
llm:
model: openai/gpt-4o-mini
temperature: 0.7
max_tokens: 2048
concurrency: 16
readers:
- type: pdf
path: docs/policy.pdf
chunk_strategy: heading
# One gates list — the schema gate runs before generators, the others after
gates:
- type: schema
min_tokens: 30
max_tokens: 4096
- type: hallucination
hallucination_threshold: 0.7
- type: reward
reward_threshold: 0.75
reward_dimensions: [helpfulness, honesty, instruction_following, depth]
normalizers:
- type: exact_dedup
- type: text_cleaner
generators:
- type: preference
preference_mode: single_call
exporters:
- type: dpo
YAML for a local vLLM endpoint¶
llm:
model: openai/Qwen/Qwen3-8B
api_base: http://localhost:8000/v1
api_key: token-abc123
temperature: 0.7
concurrency: 32
extra_body:
chat_template_kwargs:
enable_thinking: false
YAML schema reference¶
Top-level keys¶
| Key | Type | Default | Description |
|---|---|---|---|
name |
string | "curatorkit_pipeline" |
Pipeline name (written to manifest) |
version |
string | "0.2.0" |
Config version (written to manifest) |
readers |
list[ReaderConfig] | [] |
Input sources |
gates |
list[GateConfig] | [] |
Quality filters (schema runs before generators; others run after) |
normalizers |
list[NormalizerConfig] | [] |
Dedup, cleaning, sampling |
generators |
list[GenerationConfig] | [] |
LLM generation tasks |
exporters |
list[ExporterConfig] | [] |
Output formats |
llm |
LLMConfig | null | Global LLM config |
diagnostic |
DiagnosticConfig | null | Inline probe config |
max_samples |
int | null | Cap total samples after all readers |
output_split |
dict[str, float] | null | Train/val/test split fractions (must sum to 1.0) |
Reader types¶
type |
Required fields | Notes |
|---|---|---|
jsonl |
path |
— |
json |
path |
Use json_data_key if records are nested |
csv |
path |
csv_delimiter auto-detected for TSV |
parquet |
path |
Requires [connectors] extra |
huggingface |
path (dataset name) |
Use hf_split, hf_subset, hf_streaming, hf_token |
pdf |
path |
Use chunk_strategy, chunk_max_tokens, output_mode |
All readers accept:
- format: "auto" (default) or explicit format name
- field_mapping: dict remapping source column names to DataSample fields
- detection_sample_size: how many rows to inspect for auto-detection
Gate types¶
type |
Key fields |
|---|---|
schema |
min_tokens, max_tokens, required_fields, enforce_task_types |
hallucination |
hallucination_threshold (0.0–1.0), hallucination_llm_model |
reward |
reward_threshold, reward_dimensions, reward_llm_model |
diversity |
similarity_threshold, embedding_model, embedding_device |
Normalizer types¶
type |
Key fields |
|---|---|
exact_dedup |
— |
minhash_dedup |
minhash_threshold, minhash_ngram, minhash_num_perm |
text_cleaner |
transforms dict, clean_fields list |
embedding_dedup |
embedding_index_dir, embedding_model, embedding_threshold |
stratified_sampler |
category_field, target_distribution, sampler_seed |
Generator types¶
type |
Key fields |
|---|---|
qa |
num_questions, difficulty, prompt_template |
preference |
preference_mode (single_call/two_pass), prompt_template |
grpo |
num_responses, score_responses, temperature_spread (default 0.6) |
multiturn |
num_turns, include_context, prompt_template |
evol_instruct |
num_evolutions, strategies, generate_answers |
cot |
cot_mode (generate/wrap), prompt_template |
adversarial_preference |
injection_rate, injection_types, injection_seed |
adversarial_qa |
injection_rate, injection_types, injection_seed, high_temp |
In YAML, temperature_spread defaults to 0.6 — GRPO rollouts are sampled at temperatures spread around the base LLM temperature. Note that the Python API equivalent (grpo_temperature_spread on CuratorConfig) defaults to 0.0, so set it explicitly if you need identical behaviour across both channels.
DiagnosticConfig¶
diagnostic:
enable_probe: true
probe_temperatures: [0.3, 0.5]
score_split: 0.5
probe_generator_model: null # null = use global llm.model
extra_templates:
strict_grounding: "Answer using only the passage.\n\nPassage:\n{source}\n\nQuestion:\n{question}"
domain_specific: "You are a legal analyst. Answer from the passage only.\n\nPassage:\n{source}\n\nQuestion:\n{question}"
Python API vs YAML — feature parity¶
Most features available in CuratorConfig (Python) are also available in the YAML config. Key differences:
preprocessing_fn: Python API accepts a callable; YAML accepts a dotted module path string ("mymodule.my_fn") that is imported at runtime.- Data hygiene (
secrets_gate,pii_pseudonymize,toxicity_gate): Python API flags only. YAML equivalents usetype: secrets,type: toxicityingatesandtype: pii_pseudonymizerinnormalizers.
For full control, use the Python API. For reproducible scheduled runs and CI, use the YAML CLI.