CLI and YAML Pipeline¶

CuratorKIT has two entry points:

Channel	When to use
Python API (`CuratorConfig` + `Curator`)	Scripting, notebooks, programmatic control
CLI + YAML (`curatorkit run pipeline.yaml`)	Reproducible runs, CI/CD, no-code configuration

Both produce identical output. The YAML config maps to the same pipeline steps as CuratorConfig.

CLI commands¶

curatorkit --help

`run` — execute a pipeline¶

curatorkit run pipeline.yaml
curatorkit run pipeline.yaml --output-dir output/run1
curatorkit run pipeline.yaml --dry-run          # validate config and print plan, no execution
curatorkit run pipeline.yaml --async            # async runner (faster for generation tasks)
curatorkit run pipeline.yaml --verbose          # show per-stage counts during run
curatorkit run pipeline.yaml --reset-index      # clear persistent embedding index before run

--dry-run is useful before an expensive generation run — it validates the YAML, prints every step in order with its config, and exits. No LLM calls are made.

`setup-pdf` — verify the PDF parsing setup¶

PDF parsing uses MinerU, installed via the pdf extra. Model weights download automatically on the first PDF parse — there is no manual download step.

pip install "curatorkit[pdf]"
curatorkit setup-pdf            # verify MinerU is installed and importable
curatorkit setup-pdf --check    # same verification; --check is the only flag

If MinerU is missing, the command exits non-zero and prints the install instructions above.

YAML pipeline config¶

The YAML file is parsed and validated by Pydantic before any step runs. A malformed config produces a clear error immediately — not a crash mid-pipeline.

Minimal example — clean and deduplicate¶

name: clean_alpaca
version: "1.0"

readers:
  - type: huggingface
    path: tatsu-lab/alpaca
    hf_split: train

gates:
  - type: schema
    min_tokens: 10
    max_tokens: 2048

normalizers:
  - type: exact_dedup
  - type: text_cleaner

exporters:
  - type: alpaca
  - type: sharegpt

Full example — QA generation with filtering and recovery¶

name: qa_handbook
version: "1.0"

# Global LLM config — shared by all generation tasks and gates unless overridden
llm:
  model: openai/gpt-4o-mini
  temperature: 0.7
  max_tokens: 1024
  concurrency: 10
  api_base: null                            # null = use provider default

readers:
  - type: pdf
    path: docs/handbook.pdf
    chunk_strategy: heading
    chunk_max_tokens: 512
    chunk_overlap_tokens: 50

gates:
  - type: schema
    min_tokens: 20
    max_tokens: 4096
  - type: hallucination
    hallucination_threshold: 0.7
    hallucination_llm_model: null           # null = use global llm.model
  - type: reward
    reward_threshold: 0.7
    reward_dimensions: [helpfulness, honesty, instruction_following, depth]
    reward_llm_model: null

normalizers:
  - type: exact_dedup
  - type: minhash_dedup
    minhash_threshold: 0.85
  - type: text_cleaner

generators:
  - type: qa
    num_questions: 3
    difficulty: medium

# Adaptive recovery — attaches to HallucinationGate
diagnostic:
  enable_probe: true
  probe_temperatures: [0.3, 0.5]
  score_split: 0.5

exporters:
  - type: alpaca
  - type: sharegpt

output_split:
  train: 0.8
  val: 0.1
  test: 0.1

DPO preference pair example¶

name: dpo_from_pdf
version: "1.0"

llm:
  model: openai/gpt-4o-mini
  temperature: 0.7
  max_tokens: 2048
  concurrency: 16

readers:
  - type: pdf
    path: docs/policy.pdf
    chunk_strategy: heading

# One gates list — the schema gate runs before generators, the others after
gates:
  - type: schema
    min_tokens: 30
    max_tokens: 4096
  - type: hallucination
    hallucination_threshold: 0.7
  - type: reward
    reward_threshold: 0.75
    reward_dimensions: [helpfulness, honesty, instruction_following, depth]

normalizers:
  - type: exact_dedup
  - type: text_cleaner

generators:
  - type: preference
    preference_mode: single_call

exporters:
  - type: dpo

YAML for a local vLLM endpoint¶

llm:
  model: openai/Qwen/Qwen3-8B
  api_base: http://localhost:8000/v1
  api_key: token-abc123
  temperature: 0.7
  concurrency: 32
  extra_body:
    chat_template_kwargs:
      enable_thinking: false

YAML schema reference¶

Top-level keys¶

Key	Type	Default	Description
`name`	string	`"curatorkit_pipeline"`	Pipeline name (written to manifest)
`version`	string	`"0.2.0"`	Config version (written to manifest)
`readers`	list[ReaderConfig]	`[]`	Input sources
`gates`	list[GateConfig]	`[]`	Quality filters (schema runs before generators; others run after)
`normalizers`	list[NormalizerConfig]	`[]`	Dedup, cleaning, sampling
`generators`	list[GenerationConfig]	`[]`	LLM generation tasks
`exporters`	list[ExporterConfig]	`[]`	Output formats
`llm`	LLMConfig	null	Global LLM config
`diagnostic`	DiagnosticConfig	null	Inline probe config
`max_samples`	int	null	Cap total samples after all readers
`output_split`	dict[str, float]	null	Train/val/test split fractions (must sum to 1.0)

Reader types¶

`type`	Required fields	Notes
`jsonl`	`path`	—
`json`	`path`	Use `json_data_key` if records are nested
`csv`	`path`	`csv_delimiter` auto-detected for TSV
`parquet`	`path`	Requires `[connectors]` extra
`huggingface`	`path` (dataset name)	Use `hf_split`, `hf_subset`, `hf_streaming`, `hf_token`
`pdf`	`path`	Use `chunk_strategy`, `chunk_max_tokens`, `output_mode`

All readers accept: - format: "auto" (default) or explicit format name - field_mapping: dict remapping source column names to DataSample fields - detection_sample_size: how many rows to inspect for auto-detection

Gate types¶

`type`	Key fields
`schema`	`min_tokens`, `max_tokens`, `required_fields`, `enforce_task_types`
`hallucination`	`hallucination_threshold` (0.0–1.0), `hallucination_llm_model`
`reward`	`reward_threshold`, `reward_dimensions`, `reward_llm_model`
`diversity`	`similarity_threshold`, `embedding_model`, `embedding_device`

Normalizer types¶

`type`	Key fields
`exact_dedup`	—
`minhash_dedup`	`minhash_threshold`, `minhash_ngram`, `minhash_num_perm`
`text_cleaner`	`transforms` dict, `clean_fields` list
`embedding_dedup`	`embedding_index_dir`, `embedding_model`, `embedding_threshold`
`stratified_sampler`	`category_field`, `target_distribution`, `sampler_seed`

Generator types¶

`type`	Key fields
`qa`	`num_questions`, `difficulty`, `prompt_template`
`preference`	`preference_mode` (`single_call`/`two_pass`), `prompt_template`
`grpo`	`num_responses`, `score_responses`, `temperature_spread` (default `0.6`)
`multiturn`	`num_turns`, `include_context`, `prompt_template`
`evol_instruct`	`num_evolutions`, `strategies`, `generate_answers`
`cot`	`cot_mode` (`generate`/`wrap`), `prompt_template`
`adversarial_preference`	`injection_rate`, `injection_types`, `injection_seed`
`adversarial_qa`	`injection_rate`, `injection_types`, `injection_seed`, `high_temp`

In YAML, temperature_spread defaults to 0.6 — GRPO rollouts are sampled at temperatures spread around the base LLM temperature. Note that the Python API equivalent (grpo_temperature_spread on CuratorConfig) defaults to 0.0, so set it explicitly if you need identical behaviour across both channels.

DiagnosticConfig¶

diagnostic:
  enable_probe: true
  probe_temperatures: [0.3, 0.5]
  score_split: 0.5
  probe_generator_model: null       # null = use global llm.model
  extra_templates:
    strict_grounding: "Answer using only the passage.\n\nPassage:\n{source}\n\nQuestion:\n{question}"
    domain_specific:  "You are a legal analyst. Answer from the passage only.\n\nPassage:\n{source}\n\nQuestion:\n{question}"

Python API vs YAML — feature parity¶

Most features available in CuratorConfig (Python) are also available in the YAML config. Key differences:

preprocessing_fn: Python API accepts a callable; YAML accepts a dotted module path string ("mymodule.my_fn") that is imported at runtime.
Data hygiene (secrets_gate, pii_pseudonymize, toxicity_gate): Python API flags only. YAML equivalents use type: secrets, type: toxicity in gates and type: pii_pseudonymizer in normalizers.

For full control, use the Python API. For reproducible scheduled runs and CI, use the YAML CLI.