Skip to content

Guides

Each guide covers one stage of the pipeline in depth. All examples are runnable as written; configuration shown in Python applies equally to YAML pipelines.

  • Data sources


    Readers for JSONL, JSON, CSV, Parquet, HuggingFace Hub, and PDF. Field mapping, format detection, preprocessing functions, multi-source runs.

  • Generation


    The eight generation tasks (QA, preference, GRPO, multi-turn, Evol-Instruct, chain-of-thought, and the adversarial variants) and their prompt structures.

  • Quality gates


    The schema, hallucination, reward, and diversity gates: what each checks, rejection reasons, and threshold tuning.

  • Adaptive recovery


    How rejected samples are diagnosed and repaired: the inline diagnostic probe, the failure-mode taxonomy, and the reward refiner.

  • Data hygiene


    Secrets detection, PII pseudonymisation, and toxicity filtering as pipeline stages, in Python, YAML, and the CLI.

  • Exporters


    Alpaca, ShareGPT, DPO, GRPO, PPO, and corpus formats; trainer compatibility; train/val/test splits.

  • Train with AlignTune


    The curate-then-train workflow: export with CuratorKIT, publish to the Hub, fine-tune with AlignTune.

  • Customisation


    Custom prompt templates, LLM backends, reward rubrics, preprocessing functions, and extension points.