Hygiene¶
Secrets detection, PII pseudonymisation, and toxicity filtering. All require the
hygiene extra.
curatorkit.hygiene ¶
CuratorKIT hygiene — data safety components.
Three lightweight, infrastructure-reusing components that scrub training data before LLM generation begins:
ToxicityGate — two-stage filter: Detoxify classifier for clear cases, LLM judge for borderline samples. SecretsGate — rejects samples containing credentials, API keys, or high-entropy secrets (yelp/detect-secrets). PIIPseudonymizer — replaces PII entities with consistent fake values per document (clinical NLP de-identification style).
All three plug into existing extension points (BaseGate / BaseNormalizer) and write structured notes to the provenance chain.
Install dependencies
pip install "curatorkit[hygiene]"
PIIPseudonymizer ¶
PIIPseudonymizer(entity_types: list[str] | None = None, fields: list[str] | None = None, score_threshold: float = 0.7, faker_seed: int = 42, language: str = 'en', spacy_model: str = 'en_core_web_lg')
Bases: BaseNormalizer
Replace PII entities with consistent fake values (per-sample scope).
Task-aware: selects fields based on sample.task_type so preference pairs (chosen/rejected) and GRPO rollouts (responses) are always pseudonymized when relevant.
Parameters¶
entity_types : list[str] | None Presidio entity types to detect and replace. Use ENTITY_TYPES_CLINICAL for clinical corpora that need DATE_TIME and location replacement. fields : list[str] | None DataSample fields to process. Defaults to all task-relevant fields. Explicitly setting this overrides the task-aware selection entirely. score_threshold : float Presidio confidence threshold. Lower = more aggressive detection. faker_seed : int Seed for Faker (reproducible replacements across runs with same seed). language : str Presidio analysis language. spacy_model : str spaCy model name. "en_core_web_lg" (default, ~800 MB, highest accuracy) or "en_core_web_sm" (~12 MB, adequate for standard PII types).
SecretsGate ¶
SecretsGate(fields: list[str] | None = None, code_corpus_mode: bool = False, plugins: list[dict] | None = None)
Bases: BaseGate
Reject samples containing credentials, API keys, or high-entropy secrets.
Task-aware: selects fields based on sample.task_type so preference pairs (chosen/rejected) and GRPO rollouts (responses) are always scanned when relevant.
Parameters¶
fields : list[str] | None DataSample fields to scan. Defaults to all task-relevant fields. Explicitly setting this overrides the task-aware selection entirely. code_corpus_mode : bool If True, enables KeywordDetector (entropy + keyword proximity scanning). Disable for prose corpora (legal, medical, academic) to avoid false positives on legitimate mentions of "key", "password", "secret", etc. plugins : list[dict] | None Full detect-secrets plugin config. Overrides code_corpus_mode if set.
ToxicityGate ¶
ToxicityGate(classifier_pass_threshold: float = 0.1, classifier_reject_threshold: float = 0.5, llm: BaseLLM | None = None, llm_reject_threshold: float = 0.5, detoxify_model: str = 'unbiased', text_field: str = 'auto')
Bases: BaseGate
Two-stage toxicity filter.
Stage 1: Detoxify classifier (fast, deterministic, no API cost). Stage 2: LLM judge for borderline samples only (optional).
Parameters¶
classifier_pass_threshold : float Max toxicity score below which a sample passes without LLM escalation. Default 0.1; raise to ~0.2 for academic/legal/medical corpora. classifier_reject_threshold : float Max toxicity score at or above which a sample is rejected without LLM. Default 0.5. If llm=None, the band [pass_threshold, reject_threshold) is also rejected (no escalation path). llm : BaseLLM | None LLM backend for borderline samples. If None, samples in the borderline band are rejected at classifier_pass_threshold. llm_reject_threshold : float LLM toxicity score at or above which a borderline sample is rejected. detoxify_model : str "unbiased" (default, recommended), "original", or "multilingual". text_field : str "auto" scores the fields most relevant to the task_type; or pass a specific DataSample field name.