Skip to content

Tutorials

Hands-on notebooks covering every part of CuratorKIT, from ingestion and cleaning, through LLM generation and adaptive recovery, to data hygiene. Each notebook runs standalone: open it on GitHub, or launch it directly in Google Colab.

# Tutorial What you'll learn Links
01 Generate an SFT dataset from a PDF Turn any PDF into instruction-following (Alpaca-format) training data, as QA, evolved-instruction, or chain-of-thought tasks, with hallucination and reward gating built in. GitHub · Open In Colab
02 Generate DPO preference pairs Build chosen/rejected preference pairs from a PDF, with dual-scored gating that enforces quality contrast between the two answers. GitHub · Open In Colab
03 Generate GRPO rollouts Two-stage pipeline: turn a PDF into question prompts, then generate multiple reward-scored rollouts per question for GRPO training. GitHub · Open In Colab
04 Ingest multiple sources Merge three heterogeneous datasets (Alpaca, hh-rlhf, GSM8K) with per-source preprocessing functions, sample caps, deduplication, and stratified resampling. GitHub · Open In Colab
05 Clean and deduplicate a dataset The simplest pipeline: read one dataset with format auto-detection, then deduplicate, clean, filter, and export. No LLM required. GitHub · Open In Colab
06 Adaptive recovery Recover gate-rejected samples instead of discarding them, using inline diagnostic probes and a post-pipeline reward refiner. GitHub · Open In Colab
07 Adversarial generation Use custom prompt templates to generate deliberately contaminated data (credentials, PII, toxic content) for stress-testing the hygiene gates. GitHub · Open In Colab
08 Data hygiene pipeline Run SecretsGate, PIIPseudonymizer, and ToxicityGate over a contaminated dataset to catch secrets, pseudonymise PII, and reject toxic content with no LLM calls. GitHub · Open In Colab

Prefer plain scripts?

Script versions of these workflows live in examples/ — one file per workflow, with the required extras in each docstring.

What you'll need

  • No LLM required: notebooks 04, 05, and 08 run entirely locally and are the best place to start.
  • LLM endpoint required: notebooks 01-03, 06, and 07 need an OpenAI-compatible endpoint (a local vLLM or Ollama server, or any hosted API). Each notebook includes backend setup instructions.

Suggested learning path: start with 05 (cleaning and deduplication), move to 04 (multi-source ingestion), then work through generation (01-03), recovery (06), and hygiene (07-08).