Skip to content

Installation

CuratorKIT requires Python 3.11 or newer and runs on Linux, macOS, and Windows. The hygiene and pdf extras pull large model stacks (torch, MinerU) with their own platform notes; the core package and connectors are pure Python.

pip install "curatorkit[all]"

The all extra installs connectors, LLM generation, embedding, and the data hygiene gates. The core package alone covers cleaning and deduplication:

pip install curatorkit

Selecting extras

Install only what you need. Extras compose: pip install "curatorkit[generation,hf]".

Extra Adds Install when you need
hf datasets, huggingface_hub HuggingFace Hub datasets
parquet pyarrow Parquet files
connectors hf + parquet All file/Hub readers in one extra
tiktoken tiktoken Exact LLM token counts in the schema gate
generation litellm, tenacity, nest-asyncio Synthetic data generation with any LLM API
embedding sentence-transformers, numpy Diversity gate, cross-run dedup
embedding-faiss embedding + faiss-cpu Fast ANN for large dedup indexes
generation-full generation + embedding-faiss Generation with all gates
hygiene detect-secrets, presidio, detoxify, spacy, faker Secrets, PII, and toxicity gates
pdf mineru Layout-aware PDF parsing
all connectors + tiktoken + generation-full + hygiene The full pipeline (excludes pdf and trl)
docs, dev, trl site/tooling/integration-test deps Contributing

The pdf extra is excluded from all because it pulls a large model stack. It runs on CPU anywhere; for CUDA acceleration install a CUDA build of torch first. MinerU is licensed AGPL-3.0, so confirm that suits your use before installing.

From source

pip install "curatorkit[all] @ git+https://github.com/Lexsi-Labs/CuratorKIT.git"

Verify the install

curatorkit --version

Next: Quickstart