Post-training data, curated with proof.¶

CuratorKIT builds LLM training datasets as a gated pipeline: ingest from any source, generate with any LLM, verify every sample against its source, recover what fails, and export trainer-ready formats, with a provenance manifest on every run.

Get started View on GitHub

v1.0 MIT Python 3.11+ core runs CPU-only any LiteLLM backend

flowchart LR
    A[Ingest] --> B[Clean + dedup]
    B --> C[Hygiene]
    C --> D[Generate]
    D --> E{Quality gates}
    E -->|pass| F[Export]
    E -->|reject| G[Adaptive recovery]
    G -->|recovered| E
    G -->|unrecoverable| H[rejected.jsonl]
    F --> I[manifest · dataset card · checksums]

Grounded hallucination gate

Each generated answer is verified against the exact source chunk it was generated from, not against the judge model's general knowledge.
Adaptive recovery

Rejections are diagnosed against a failure-mode taxonomy; the recoverable ones are repaired and re-gated instead of discarded.
Data hygiene

Secrets detection, PII pseudonymisation, and toxicity filtering run as pipeline stages, before any sample reaches a training file.
Any source

JSONL, JSON, CSV, Parquet, HuggingFace Hub, and layout-parsed PDFs. Multi-source runs support per-source field mapping.
Eight generation tasks

QA, preference pairs, GRPO rollouts, multi-turn, Evol-Instruct, chain-of-thought, and adversarial variants, on any LiteLLM backend or local Ollama/vLLM.
Trainer-ready exports

Alpaca, ShareGPT, DPO, GRPO, and PPO with train/val/test splits, consumed directly by TRL and AlignTune.

Sixty seconds¶

PythonCLI

from curatorkit import Curator, CuratorConfig

result = Curator(CuratorConfig(
    dataset        = {"name": "tatsu-lab/alpaca", "max_samples": 2000},
    dedup          = "minhash",
    clean          = True,
    export_formats = ["alpaca", "sharegpt"],
    output_dir     = "output/clean",
)).run()

result.print_summary()

No LLM or API key needed for cleaning and dedup. Add llm_model and generation_task for gated synthetic generation; the generation guide covers it.

pip install "curatorkit[all]"
curatorkit run pipeline.yaml --output-dir output/

Pipelines are declarative YAML, validated before anything runs. A runnable no-API-key example ships in examples/quickstart/; the schema is in the CLI reference.

Every run writes manifest.json, rejected.jsonl, dataset_card.md, and checksums.txt alongside the export files.

Where next¶


Getting started	Install, the three usage patterns, reading output
Guides	Each pipeline stage in depth
Configuration	Every `CuratorConfig` parameter
API reference	Generated from the source docstrings
Tutorials	Nine notebooks, each runnable in Colab
Roadmap	Where 1.0 goes from here

Run the tutorials in Colab¶


01 Generate an SFT dataset from a PDF		LLM endpoint
02 Generate DPO preference pairs		LLM endpoint
03 Generate GRPO rollouts		LLM endpoint
04 Ingest multiple sources		no LLM
05 Clean and deduplicate a dataset		no LLM
06 Adaptive recovery		LLM endpoint
07 Adversarial generation		LLM endpoint
08 Data hygiene pipeline		no LLM
09 Filtered vs unfiltered fine-tuning		LLM endpoint + GPU

New to the library? Start with 05, then 04, then the generation notebooks. The tutorials index has full descriptions.

Cite¶

If you use CuratorKIT in your research, please cite the library and the relevant paper(s):

@software{curatorkit2026,
  author    = {Bhattacharjee, Soham and Sharma, Karun and Sankarapu, Vinay Kumar and Seth, Pratinav},
  title     = {CuratorKIT: Data Curation and Synthetic Data Generation for LLM Post-Training},
  year      = {2026},
  publisher = {Lexsi Labs},
  url       = {https://github.com/Lexsi-Labs/CuratorKIT}
}

@misc{bhattacharjee2026curatorkitdatacuration,
      title={CuratorKIT : Data Curation and Synthetic Data Generation for LLM Post-Training},
      author={Soham Bhattacharjee and Karun Sharma and Vinay Kumar Sankarapu and Pratinav Seth},
      year={2026},
      eprint={2606.21631},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.21631},
}

@misc{bhattacharjee2026provenancegroundedgatingadaptiverecovery,
      title={Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation},
      author={Soham Bhattacharjee and Karun Sharma and Vinay Kumar Sankarapu and Pratinav Seth},
      year={2026},
      eprint={2606.11127},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.11127},
}

CuratorKIT is built by Lexsi Labs alongside AlignTune, which consumes its exports natively: curate here, train there.