Tutorials¶

Hands-on notebooks covering every part of CuratorKIT, from ingestion and cleaning, through LLM generation and adaptive recovery, to data hygiene. Each notebook runs standalone: open it on GitHub, or launch it directly in Google Colab.

#	Tutorial	What you'll learn	Links
01	Generate an SFT dataset from a PDF	Turn any PDF into instruction-following (Alpaca-format) training data, as QA, evolved-instruction, or chain-of-thought tasks, with hallucination and reward gating built in.	GitHub ·
02	Generate DPO preference pairs	Build chosen/rejected preference pairs from a PDF, with dual-scored gating that enforces quality contrast between the two answers.	GitHub ·
03	Generate GRPO rollouts	Two-stage pipeline: turn a PDF into question prompts, then generate multiple reward-scored rollouts per question for GRPO training.	GitHub ·
04	Ingest multiple sources	Merge three heterogeneous datasets (Alpaca, hh-rlhf, GSM8K) with per-source preprocessing functions, sample caps, deduplication, and stratified resampling.	GitHub ·
05	Clean and deduplicate a dataset	The simplest pipeline: read one dataset with format auto-detection, then deduplicate, clean, filter, and export. No LLM required.	GitHub ·
06	Adaptive recovery	Recover gate-rejected samples instead of discarding them, using inline diagnostic probes and a post-pipeline reward refiner.	GitHub ·
07	Adversarial generation	Use custom prompt templates to generate deliberately contaminated data (credentials, PII, toxic content) for stress-testing the hygiene gates.	GitHub ·
08	Data hygiene pipeline	Run `SecretsGate`, `PIIPseudonymizer`, and `ToxicityGate` over a contaminated dataset to catch secrets, pseudonymise PII, and reject toxic content with no LLM calls.	GitHub ·

Prefer plain scripts?¶

Script versions of these workflows live in examples/ — one file per workflow, with the required extras in each docstring.

What you'll need¶

No LLM required: notebooks 04, 05, and 08 run entirely locally and are the best place to start.
LLM endpoint required: notebooks 01-03, 06, and 07 need an OpenAI-compatible endpoint (a local vLLM or Ollama server, or any hosted API). Each notebook includes backend setup instructions.

Suggested learning path: start with 05 (cleaning and deduplication), move to 04 (multi-source ingestion), then work through generation (01-03), recovery (06), and hygiene (07-08).