![]()
Post-training data, curated with proof.¶
CuratorKIT builds LLM training datasets as a gated pipeline: ingest from any source, generate with any LLM, verify every sample against its source, recover what fails, and export trainer-ready formats, with a provenance manifest on every run.
v1.0 MIT Python 3.11+ core runs CPU-only any LiteLLM backend
flowchart LR
A[Ingest] --> B[Clean + dedup]
B --> C[Hygiene]
C --> D[Generate]
D --> E{Quality gates}
E -->|pass| F[Export]
E -->|reject| G[Adaptive recovery]
G -->|recovered| E
G -->|unrecoverable| H[rejected.jsonl]
F --> I[manifest · dataset card · checksums]
-
Grounded hallucination gate
Each generated answer is verified against the exact source chunk it was generated from, not against the judge model's general knowledge.
-
Adaptive recovery
Rejections are diagnosed against a failure-mode taxonomy; the recoverable ones are repaired and re-gated instead of discarded.
-
Data hygiene
Secrets detection, PII pseudonymisation, and toxicity filtering run as pipeline stages, before any sample reaches a training file.
-
Any source
JSONL, JSON, CSV, Parquet, HuggingFace Hub, and layout-parsed PDFs. Multi-source runs support per-source field mapping.
-
Eight generation tasks
QA, preference pairs, GRPO rollouts, multi-turn, Evol-Instruct, chain-of-thought, and adversarial variants, on any LiteLLM backend or local Ollama/vLLM.
-
Trainer-ready exports
Alpaca, ShareGPT, DPO, GRPO, and PPO with train/val/test splits, consumed directly by TRL and AlignTune.
Sixty seconds¶
from curatorkit import Curator, CuratorConfig
result = Curator(CuratorConfig(
dataset = {"name": "tatsu-lab/alpaca", "max_samples": 2000},
dedup = "minhash",
clean = True,
export_formats = ["alpaca", "sharegpt"],
output_dir = "output/clean",
)).run()
result.print_summary()
No LLM or API key needed for cleaning and dedup. Add llm_model and
generation_task for gated synthetic generation; the
generation guide covers it.
Pipelines are declarative YAML, validated before anything runs. A runnable
no-API-key example ships in
examples/quickstart/;
the schema is in the CLI reference.
Every run writes manifest.json, rejected.jsonl, dataset_card.md, and
checksums.txt alongside the export files.
Where next¶
| Getting started | Install, the three usage patterns, reading output |
| Guides | Each pipeline stage in depth |
| Configuration | Every CuratorConfig parameter |
| API reference | Generated from the source docstrings |
| Tutorials | Nine notebooks, each runnable in Colab |
| Roadmap | Where 1.0 goes from here |
Run the tutorials in Colab¶
New to the library? Start with 05, then 04, then the generation notebooks. The tutorials index has full descriptions.
Cite¶
If you use CuratorKIT in your research, please cite the library and the relevant paper(s):
@software{curatorkit2026,
author = {Bhattacharjee, Soham and Sharma, Karun and Sankarapu, Vinay Kumar and Seth, Pratinav},
title = {CuratorKIT: Data Curation and Synthetic Data Generation for LLM Post-Training},
year = {2026},
publisher = {Lexsi Labs},
url = {https://github.com/Lexsi-Labs/CuratorKIT}
}
@misc{bhattacharjee2026curatorkitdatacuration,
title={CuratorKIT : Data Curation and Synthetic Data Generation for LLM Post-Training},
author={Soham Bhattacharjee and Karun Sharma and Vinay Kumar Sankarapu and Pratinav Seth},
year={2026},
eprint={2606.21631},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.21631},
}
@misc{bhattacharjee2026provenancegroundedgatingadaptiverecovery,
title={Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation},
author={Soham Bhattacharjee and Karun Sharma and Vinay Kumar Sankarapu and Pratinav Seth},
year={2026},
eprint={2606.11127},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.11127},
}
CuratorKIT is built by Lexsi Labs alongside AlignTune, which consumes its exports natively: curate here, train there.