Guides¶

Each guide covers one stage of the pipeline in depth. All examples are runnable as written; configuration shown in Python applies equally to YAML pipelines.

Data sources

Readers for JSONL, JSON, CSV, Parquet, HuggingFace Hub, and PDF. Field mapping, format detection, preprocessing functions, multi-source runs.
Generation

The eight generation tasks (QA, preference, GRPO, multi-turn, Evol-Instruct, chain-of-thought, and the adversarial variants) and their prompt structures.
Quality gates

The schema, hallucination, reward, and diversity gates: what each checks, rejection reasons, and threshold tuning.
Adaptive recovery

How rejected samples are diagnosed and repaired: the inline diagnostic probe, the failure-mode taxonomy, and the reward refiner.
Data hygiene

Secrets detection, PII pseudonymisation, and toxicity filtering as pipeline stages, in Python, YAML, and the CLI.
Exporters

Alpaca, ShareGPT, DPO, GRPO, PPO, and corpus formats; trainer compatibility; train/val/test splits.
Train with AlignTune

The curate-then-train workflow: export with CuratorKIT, publish to the Hub, fine-tune with AlignTune.
Customisation

Custom prompt templates, LLM backends, reward rubrics, preprocessing functions, and extension points.