Skip to content

Changelog

All notable changes to CuratorKIT are documented here. The format follows Keep a Changelog, and the project adheres to Semantic Versioning.

Unreleased

Added

  • SFT exporters (Alpaca, ShareGPT) warn when exported rows have empty instruction/output, catching task-type/format mismatches that previously produced silent empty datasets.
  • LiteLLMBackend raises a helpful ImportError at construction when the generation extra is missing, instead of rejecting every sample mid-run.

1.0.0 - 2026-06-12

First public release.

Added

  • Data hygiene gates: SecretsGate (credential/API-key detection), ToxicityGate (local classifier with optional LLM-judge escalation), and PIIPseudonymizer (Presidio-based entity replacement), available in Python, YAML, and CLI channels.
  • Adversarial generation tasks: adversarial_qa and adversarial_preference.
  • Layout-aware PDF ingestion via the MinerU 3.x SDK (pdf extra).
  • Tutorial notebooks covering generation, ingestion, cleaning, recovery, adversarial data, and hygiene, each runnable in Colab.
  • Streaming ingestion support for large HuggingFace datasets.
  • Documentation site with guides, config reference, architecture notes, tutorials, and FAQ.
  • CI (lint, test matrix on Python 3.11-3.13, wheel-build validation, quickstart e2e, docs build), docs deployment, and PyPI publishing workflows.
  • py.typed marker: the package ships its type annotations (PEP 561).

Fixed

  • Async event-loop handling in notebooks/Jupyter; missing exporter imports in split exports.

0.2.0 - 2026-04

Added

  • LLM generation tasks: QA, preference pairs, GRPO rollouts, multi-turn, Evol-Instruct, and chain-of-thought, via LiteLLM-compatible APIs and local Ollama.
  • Quality gates: provenance-grounded hallucination gate, multi-dimension reward gate, and embedding-based diversity gate.
  • Adaptive recovery: inline diagnostic probe, failure-mode taxonomy, and reward refiner.
  • Trainer-ready exporters: Alpaca, ShareGPT, DPO, GRPO, PPO with train/val/test splits.
  • Declarative YAML pipelines and the curatorkit CLI.
  • Provenance manifest, dataset card, rejection log, and checksums on every run.

0.1.0 - 2026-03

Added

  • Core ingestion connectors: JSONL, JSON, CSV, Parquet, HuggingFace datasets, PDF.
  • Cleaning and deduplication: text cleaner, exact and MinHash dedup, stratified sampling.
  • Schema gate and the DataSample / provenance record data model.