Hygiene¶

Secrets detection, PII pseudonymisation, and toxicity filtering. All require the hygiene extra.

curatorkit.hygiene ¶

CuratorKIT hygiene — data safety components.

Three lightweight, infrastructure-reusing components that scrub training data before LLM generation begins:

ToxicityGate — two-stage filter: Detoxify classifier for clear cases, LLM judge for borderline samples. SecretsGate — rejects samples containing credentials, API keys, or high-entropy secrets (yelp/detect-secrets). PIIPseudonymizer — replaces PII entities with consistent fake values per document (clinical NLP de-identification style).

All three plug into existing extension points (BaseGate / BaseNormalizer) and write structured notes to the provenance chain.

Install dependencies

pip install "curatorkit[hygiene]"

PIIPseudonymizer ¶

PIIPseudonymizer(entity_types: list[str] | None = None, fields: list[str] | None = None, score_threshold: float = 0.7, faker_seed: int = 42, language: str = 'en', spacy_model: str = 'en_core_web_lg')

Bases: BaseNormalizer

Replace PII entities with consistent fake values (per-sample scope).

Task-aware: selects fields based on sample.task_type so preference pairs (chosen/rejected) and GRPO rollouts (responses) are always pseudonymized when relevant.

Parameters¶

entity_types : list[str] | None Presidio entity types to detect and replace. Use ENTITY_TYPES_CLINICAL for clinical corpora that need DATE_TIME and location replacement. fields : list[str] | None DataSample fields to process. Defaults to all task-relevant fields. Explicitly setting this overrides the task-aware selection entirely. score_threshold : float Presidio confidence threshold. Lower = more aggressive detection. faker_seed : int Seed for Faker (reproducible replacements across runs with same seed). language : str Presidio analysis language. spacy_model : str spaCy model name. "en_core_web_lg" (default, ~800 MB, highest accuracy) or "en_core_web_sm" (~12 MB, adequate for standard PII types).

SecretsGate ¶

SecretsGate(fields: list[str] | None = None, code_corpus_mode: bool = False, plugins: list[dict] | None = None)

Bases: BaseGate

Reject samples containing credentials, API keys, or high-entropy secrets.

Task-aware: selects fields based on sample.task_type so preference pairs (chosen/rejected) and GRPO rollouts (responses) are always scanned when relevant.

Parameters¶

fields : list[str] | None DataSample fields to scan. Defaults to all task-relevant fields. Explicitly setting this overrides the task-aware selection entirely. code_corpus_mode : bool If True, enables KeywordDetector (entropy + keyword proximity scanning). Disable for prose corpora (legal, medical, academic) to avoid false positives on legitimate mentions of "key", "password", "secret", etc. plugins : list[dict] | None Full detect-secrets plugin config. Overrides code_corpus_mode if set.

ToxicityGate ¶

ToxicityGate(classifier_pass_threshold: float = 0.1, classifier_reject_threshold: float = 0.5, llm: BaseLLM | None = None, llm_reject_threshold: float = 0.5, detoxify_model: str = 'unbiased', text_field: str = 'auto')

Bases: BaseGate

Two-stage toxicity filter.

Stage 1: Detoxify classifier (fast, deterministic, no API cost). Stage 2: LLM judge for borderline samples only (optional).

Parameters¶

classifier_pass_threshold : float Max toxicity score below which a sample passes without LLM escalation. Default 0.1; raise to ~0.2 for academic/legal/medical corpora. classifier_reject_threshold : float Max toxicity score at or above which a sample is rejected without LLM. Default 0.5. If llm=None, the band [pass_threshold, reject_threshold) is also rejected (no escalation path). llm : BaseLLM | None LLM backend for borderline samples. If None, samples in the borderline band are rejected at classifier_pass_threshold. llm_reject_threshold : float LLM toxicity score at or above which a borderline sample is rejected. detoxify_model : str "unbiased" (default, recommended), "original", or "multilingual". text_field : str "auto" scores the fields most relevant to the task_type; or pass a specific DataSample field name.