Skip to content

Post-training data, curated with proof.

CuratorKIT builds LLM training datasets as a gated pipeline: ingest from any source, generate with any LLM, verify every sample against its source, recover what fails, and export trainer-ready formats, with a provenance manifest on every run.

Get started View on GitHub

v1.0 MIT Python 3.11+ core runs CPU-only any LiteLLM backend

flowchart LR
    A[Ingest] --> B[Clean + dedup]
    B --> C[Hygiene]
    C --> D[Generate]
    D --> E{Quality gates}
    E -->|pass| F[Export]
    E -->|reject| G[Adaptive recovery]
    G -->|recovered| E
    G -->|unrecoverable| H[rejected.jsonl]
    F --> I[manifest · dataset card · checksums]
  • Grounded hallucination gate


    Each generated answer is verified against the exact source chunk it was generated from, not against the judge model's general knowledge.

  • Adaptive recovery


    Rejections are diagnosed against a failure-mode taxonomy; the recoverable ones are repaired and re-gated instead of discarded.

  • Data hygiene


    Secrets detection, PII pseudonymisation, and toxicity filtering run as pipeline stages, before any sample reaches a training file.

  • Any source


    JSONL, JSON, CSV, Parquet, HuggingFace Hub, and layout-parsed PDFs. Multi-source runs support per-source field mapping.

  • Eight generation tasks


    QA, preference pairs, GRPO rollouts, multi-turn, Evol-Instruct, chain-of-thought, and adversarial variants, on any LiteLLM backend or local Ollama/vLLM.

  • Trainer-ready exports


    Alpaca, ShareGPT, DPO, GRPO, and PPO with train/val/test splits, consumed directly by TRL and AlignTune.

Sixty seconds

from curatorkit import Curator, CuratorConfig

result = Curator(CuratorConfig(
    dataset        = {"name": "tatsu-lab/alpaca", "max_samples": 2000},
    dedup          = "minhash",
    clean          = True,
    export_formats = ["alpaca", "sharegpt"],
    output_dir     = "output/clean",
)).run()

result.print_summary()

No LLM or API key needed for cleaning and dedup. Add llm_model and generation_task for gated synthetic generation; the generation guide covers it.

pip install "curatorkit[all]"
curatorkit run pipeline.yaml --output-dir output/

Pipelines are declarative YAML, validated before anything runs. A runnable no-API-key example ships in examples/quickstart/; the schema is in the CLI reference.

Every run writes manifest.json, rejected.jsonl, dataset_card.md, and checksums.txt alongside the export files.

Where next

Getting started Install, the three usage patterns, reading output
Guides Each pipeline stage in depth
Configuration Every CuratorConfig parameter
API reference Generated from the source docstrings
Tutorials Nine notebooks, each runnable in Colab
Roadmap Where 1.0 goes from here

Run the tutorials in Colab

01 Generate an SFT dataset from a PDF Colab LLM endpoint
02 Generate DPO preference pairs Colab LLM endpoint
03 Generate GRPO rollouts Colab LLM endpoint
04 Ingest multiple sources Colab no LLM
05 Clean and deduplicate a dataset Colab no LLM
06 Adaptive recovery Colab LLM endpoint
07 Adversarial generation Colab LLM endpoint
08 Data hygiene pipeline Colab no LLM
09 Filtered vs unfiltered fine-tuning Colab LLM endpoint + GPU

New to the library? Start with 05, then 04, then the generation notebooks. The tutorials index has full descriptions.


Cite

If you use CuratorKIT in your research, please cite the library and the relevant paper(s):

@software{curatorkit2026,
  author    = {Bhattacharjee, Soham and Sharma, Karun and Sankarapu, Vinay Kumar and Seth, Pratinav},
  title     = {CuratorKIT: Data Curation and Synthetic Data Generation for LLM Post-Training},
  year      = {2026},
  publisher = {Lexsi Labs},
  url       = {https://github.com/Lexsi-Labs/CuratorKIT}
}

@misc{bhattacharjee2026curatorkitdatacuration,
      title={CuratorKIT : Data Curation and Synthetic Data Generation for LLM Post-Training},
      author={Soham Bhattacharjee and Karun Sharma and Vinay Kumar Sankarapu and Pratinav Seth},
      year={2026},
      eprint={2606.21631},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.21631},
}

@misc{bhattacharjee2026provenancegroundedgatingadaptiverecovery,
      title={Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation},
      author={Soham Bhattacharjee and Karun Sharma and Vinay Kumar Sankarapu and Pratinav Seth},
      year={2026},
      eprint={2606.11127},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.11127},
}

CuratorKIT is built by Lexsi Labs alongside AlignTune, which consumes its exports natively: curate here, train there.