Changelog¶
All notable changes to CuratorKIT are documented here. The format follows Keep a Changelog, and the project adheres to Semantic Versioning.
Unreleased¶
Added¶
- SFT exporters (Alpaca, ShareGPT) warn when exported rows have empty instruction/output, catching task-type/format mismatches that previously produced silent empty datasets.
LiteLLMBackendraises a helpfulImportErrorat construction when thegenerationextra is missing, instead of rejecting every sample mid-run.
1.0.0 - 2026-06-12¶
First public release.
Added¶
- Data hygiene gates:
SecretsGate(credential/API-key detection),ToxicityGate(local classifier with optional LLM-judge escalation), andPIIPseudonymizer(Presidio-based entity replacement), available in Python, YAML, and CLI channels. - Adversarial generation tasks:
adversarial_qaandadversarial_preference. - Layout-aware PDF ingestion via the MinerU 3.x SDK (
pdfextra). - Tutorial notebooks covering generation, ingestion, cleaning, recovery, adversarial data, and hygiene, each runnable in Colab.
- Streaming ingestion support for large HuggingFace datasets.
- Documentation site with guides, config reference, architecture notes, tutorials, and FAQ.
- CI (lint, test matrix on Python 3.11-3.13, wheel-build validation, quickstart e2e, docs build), docs deployment, and PyPI publishing workflows.
py.typedmarker: the package ships its type annotations (PEP 561).
Fixed¶
- Async event-loop handling in notebooks/Jupyter; missing exporter imports in split exports.
0.2.0 - 2026-04¶
Added¶
- LLM generation tasks: QA, preference pairs, GRPO rollouts, multi-turn, Evol-Instruct, and chain-of-thought, via LiteLLM-compatible APIs and local Ollama.
- Quality gates: provenance-grounded hallucination gate, multi-dimension reward gate, and embedding-based diversity gate.
- Adaptive recovery: inline diagnostic probe, failure-mode taxonomy, and reward refiner.
- Trainer-ready exporters: Alpaca, ShareGPT, DPO, GRPO, PPO with train/val/test splits.
- Declarative YAML pipelines and the
curatorkitCLI. - Provenance manifest, dataset card, rejection log, and checksums on every run.
0.1.0 - 2026-03¶
Added¶
- Core ingestion connectors: JSONL, JSON, CSV, Parquet, HuggingFace datasets, PDF.
- Cleaning and deduplication: text cleaner, exact and MinHash dedup, stratified sampling.
- Schema gate and the
DataSample/ provenance record data model.