Tutorials¶
Hands-on notebooks covering every part of CuratorKIT, from ingestion and cleaning, through LLM generation and adaptive recovery, to data hygiene. Each notebook runs standalone: open it on GitHub, or launch it directly in Google Colab.
| # | Tutorial | What you'll learn | Links |
|---|---|---|---|
| 01 | Generate an SFT dataset from a PDF | Turn any PDF into instruction-following (Alpaca-format) training data, as QA, evolved-instruction, or chain-of-thought tasks, with hallucination and reward gating built in. | GitHub · |
| 02 | Generate DPO preference pairs | Build chosen/rejected preference pairs from a PDF, with dual-scored gating that enforces quality contrast between the two answers. | GitHub · |
| 03 | Generate GRPO rollouts | Two-stage pipeline: turn a PDF into question prompts, then generate multiple reward-scored rollouts per question for GRPO training. | GitHub · |
| 04 | Ingest multiple sources | Merge three heterogeneous datasets (Alpaca, hh-rlhf, GSM8K) with per-source preprocessing functions, sample caps, deduplication, and stratified resampling. | GitHub · |
| 05 | Clean and deduplicate a dataset | The simplest pipeline: read one dataset with format auto-detection, then deduplicate, clean, filter, and export. No LLM required. | GitHub · |
| 06 | Adaptive recovery | Recover gate-rejected samples instead of discarding them, using inline diagnostic probes and a post-pipeline reward refiner. | GitHub · |
| 07 | Adversarial generation | Use custom prompt templates to generate deliberately contaminated data (credentials, PII, toxic content) for stress-testing the hygiene gates. | GitHub · |
| 08 | Data hygiene pipeline | Run SecretsGate, PIIPseudonymizer, and ToxicityGate over a contaminated dataset to catch secrets, pseudonymise PII, and reject toxic content with no LLM calls. |
GitHub · |
Prefer plain scripts?¶
Script versions of these workflows live in
examples/ — one
file per workflow, with the required extras in each docstring.
What you'll need¶
- No LLM required: notebooks 04, 05, and 08 run entirely locally and are the best place to start.
- LLM endpoint required: notebooks 01-03, 06, and 07 need an OpenAI-compatible endpoint (a local vLLM or Ollama server, or any hosted API). Each notebook includes backend setup instructions.
Suggested learning path: start with 05 (cleaning and deduplication), move to 04 (multi-source ingestion), then work through generation (01-03), recovery (06), and hygiene (07-08).