Data Hygiene¶

Data hygiene components run during ingestion and cleaning — before any LLM call. This means:

Credentials and API keys never reach an external API.
PII in source documents is pseudonymized before a model generates continuations, so the output inherits clean entities rather than leaking real ones.
Toxic source material is discarded before you pay for generation.

Three components are available:

Component	Type	What it does
`SecretsGate`	Gate (rejects)	Rejects samples containing credentials, API keys, or high-entropy secrets
`PIIPseudonymizer`	Normalizer (modifies)	Replaces PII entities with consistent fake values (per-sample scope)
`ToxicityGate`	Gate (rejects)	Rejects toxic content in two stages: Stage 1 (local classifier), Stage 2 (optional LLM judge)

All three are task-aware: they automatically select the correct fields to scan or pseudonymize based on sample.task_type — checking chosen/rejected for preference data, responses for GRPO rollouts, and output for source chunks.

Install¶

pip install "curatorkit[hygiene]"

This installs: detoxify, detect-secrets, presidio-analyzer, presidio-anonymizer, spacy, and faker.

For PIIPseudonymizer, also download the spaCy model:

python -m spacy download en_core_web_lg   # default (~800 MB, highest accuracy)
# or for dev/CI:
python -m spacy download en_core_web_sm   # ~12 MB, adequate for standard PII types

Channel 1 — `CuratorConfig` Python API¶

Set flags directly on CuratorConfig. The hygiene steps are inserted automatically in the right order.

SecretsGate¶

from curatorkit import Curator, CuratorConfig

result = Curator(CuratorConfig(
    dataset             = "data/raw.jsonl",
    secrets_gate        = True,
    # Enable KeywordDetector for code corpora (off by default — too noisy for prose)
    secrets_code_corpus_mode = False,
)).run()

PIIPseudonymizer¶

from curatorkit import Curator, CuratorConfig

result = Curator(CuratorConfig(
    dataset             = "data/medical_notes.jsonl",
    pii_pseudonymize    = True,
    # pii_entity_types  = []       # empty = default types (no DATE_TIME)
    # pii_score_threshold = 0.7    # Presidio confidence threshold
    # pii_spacy_model   = "en_core_web_lg"
    # pii_faker_seed    = 42       # reproducible replacements
)).run()

For clinical or legal corpora where dates and locations are also PII:

from curatorkit.hygiene.pii import ENTITY_TYPES_CLINICAL

result = Curator(CuratorConfig(
    dataset          = "data/clinical_notes.jsonl",
    pii_pseudonymize = True,
    pii_entity_types = ENTITY_TYPES_CLINICAL,   # adds DATE_TIME, LOCATION, MEDICAL_LICENSE
    pii_spacy_model  = "en_core_web_lg",
)).run()

ToxicityGate¶

from curatorkit import Curator, CuratorConfig

result = Curator(CuratorConfig(
    dataset                              = "data/raw.jsonl",
    toxicity_gate                        = True,
    # Stage 1 (local classifier) thresholds:
    #   score < pass_threshold  → pass immediately (no LLM call)
    #   score > reject_threshold → reject immediately (no LLM call)
    #   in between               → Stage 2, LLM judge (only when toxicity_llm_judge=True)
    toxicity_classifier_pass_threshold   = 0.1,
    toxicity_classifier_reject_threshold = 0.5,
    toxicity_detoxify_model              = "unbiased",  # or "original" / "multilingual"
    # Enable LLM second-opinion for borderline samples:
    toxicity_llm_judge                   = False,
    toxicity_llm_reject_threshold        = 0.5,   # LLM judge rejects at or above this score
    llm_model                            = "openai/gpt-4o-mini",  # required when llm_judge=True
)).run()

All three together¶

result = Curator(CuratorConfig(
    dataset                              = "data/raw.jsonl",
    secrets_gate                         = True,
    pii_pseudonymize                     = True,
    toxicity_gate                        = True,
    toxicity_classifier_pass_threshold   = 0.15,  # raised for academic corpora
    toxicity_classifier_reject_threshold = 0.5,
    llm_model                            = "openai/gpt-4o-mini",
    generation_task                      = "qa",
)).run()

Execution order (fixed): SecretsGate → PIIPseudonymizer → ToxicityGate → [generation]

Channel 2 — YAML pipeline config (CLI)¶

Running a pipeline¶

curatorkit run pipeline.yaml
curatorkit run pipeline.yaml --dry-run   # validate config, print step plan, exit

SecretsGate in YAML¶

Add a gate with type: secrets to the gates list. It runs before generators.

name: hygiene_pipeline
version: "0.1"

readers:
  - type: jsonl
    path: data/raw.jsonl

gates:
  - type: secrets
    secrets_code_corpus_mode: false   # set true for datasets that include source code

generators:
  - type: qa
    num_questions: 3

exporters:
  - type: alpaca

PIIPseudonymizer in YAML¶

Add a normalizer with type: pii_pseudonymizer. It runs before generators, after dedup and text cleaning.

normalizers:
  - type: exact_dedup
  - type: text_cleaner
  - type: pii_pseudonymizer
    pii_score_threshold: 0.7
    pii_spacy_model: en_core_web_lg
    pii_faker_seed: 42
    # pii_entity_types: []   # empty = default types; add DATE_TIME, LOCATION for clinical corpora

For clinical corpora, list entity types explicitly:

normalizers:
  - type: pii_pseudonymizer
    pii_entity_types:
      - PERSON
      - EMAIL_ADDRESS
      - PHONE_NUMBER
      - US_SSN
      - CREDIT_CARD
      - IP_ADDRESS
      - DATE_TIME
      - MEDICAL_LICENSE
      - LOCATION

ToxicityGate in YAML¶

Add a gate with type: toxicity. Runs before generators.

gates:
  - type: schema
  - type: secrets
  - type: toxicity
    toxicity_classifier_pass_threshold: 0.1
    toxicity_classifier_reject_threshold: 0.5
    toxicity_detoxify_model: unbiased
    # toxicity_llm_model: openai/gpt-4o-mini   # enable LLM judge for borderline band

Full hygiene pipeline YAML example¶

name: full_hygiene_pipeline
version: "0.1"

llm:
  model: openai/gpt-4o-mini
  temperature: 0.7
  concurrency: 10

readers:
  - type: jsonl
    path: data/raw_instructions.jsonl

gates:
  - type: schema
    min_tokens: 10
    max_tokens: 2048
  - type: secrets
    secrets_code_corpus_mode: false
  - type: toxicity
    toxicity_classifier_pass_threshold: 0.1
    toxicity_classifier_reject_threshold: 0.5
    toxicity_detoxify_model: unbiased

normalizers:
  - type: exact_dedup
  - type: text_cleaner
  - type: pii_pseudonymizer
    pii_score_threshold: 0.7
    pii_spacy_model: en_core_web_sm   # use sm for faster CI runs

generators:
  - type: qa
    num_questions: 3

exporters:
  - type: alpaca
  - type: sharegpt

Channel 3 — Direct module import¶

For use in custom pipelines, scripts, or when you need to compose steps manually outside of Curator.

SecretsGate¶

from curatorkit.hygiene.secrets import SecretsGate
from curatorkit.schema import DataSample

gate = SecretsGate(
    code_corpus_mode=False,
    # fields=None  → task-aware auto-selection (recommended)
    # fields=["instruction", "output"]  → explicit override
)

samples: list[DataSample] = [...]
passed, rejected = gate.run(samples)

for r in rejected:
    print(r.rejection_reason)   # e.g. "secret_detected:AWS Access Key,PrivateKeyDetector"

PIIPseudonymizer¶

from curatorkit.hygiene.pii import PIIPseudonymizer, ENTITY_TYPES_CLINICAL

pseudonymizer = PIIPseudonymizer(
    entity_types=None,           # None = default types
    score_threshold=0.7,
    spacy_model="en_core_web_lg",
    faker_seed=42,
)

samples: list[DataSample] = [...]
samples = pseudonymizer.run(samples)  # modifies in-place, returns same list

Cross-field consistency is guaranteed within each sample: if "John Smith" appears in both instruction and output, both get the same fake name.

ToxicityGate¶

from curatorkit.hygiene.toxicity import ToxicityGate

# Classifier-only (no LLM)
gate = ToxicityGate(
    classifier_pass_threshold=0.1,
    classifier_reject_threshold=0.5,
    detoxify_model="unbiased",
)

# With LLM judge for borderline samples
from curatorkit.llm.litellm import LiteLLMBackend

llm = LiteLLMBackend(model="openai/gpt-4o-mini")
gate = ToxicityGate(
    classifier_pass_threshold=0.1,
    classifier_reject_threshold=0.5,
    llm=llm,
    llm_reject_threshold=0.5,
)

passed, rejected = gate.run(samples)

Composing in a custom pipeline¶

from curatorkit.pipeline import Pipeline
from curatorkit.hygiene.secrets import SecretsGate
from curatorkit.hygiene.pii import PIIPseudonymizer
from curatorkit.hygiene.toxicity import ToxicityGate
from curatorkit.connectors.jsonl import JSONLReader
from curatorkit.exporters.alpaca import AlpacaExporter
from pathlib import Path

steps = [
    JSONLReader("data/raw.jsonl"),
    SecretsGate(),
    PIIPseudonymizer(spacy_model="en_core_web_sm"),
    ToxicityGate(),
    AlpacaExporter(),
]

pipeline = Pipeline(steps, output_dir=Path("output/"))
result = pipeline.run()

Task awareness¶

All three components automatically select the right fields per task_type. You never need to tell them about DPO pairs or GRPO rollouts — they detect the task type from the sample.

`task_type`	Fields checked / pseudonymized
`preference`, `implicit_preference`	`instruction`, `input`, `chosen`, `rejected`
`grpo`	`instruction`, `input`, `responses` (each rollout scanned independently)
`language_modeling`, `source_chunk`	`output`
`prompt_only`	`instruction`, `input`
`unpaired_preference`	`instruction`, `input`, `output`
`conversational`, `instruction_following`	`instruction`, `input`, `output`
Unknown / None	All configured fields

Override the field list for any component by passing an explicit fields= argument. When an explicit list is provided, task-aware selection is disabled entirely.

# Only scan the output field, regardless of task type
gate = SecretsGate(fields=["output"])

Rejection reasons and provenance¶

SecretsGate¶

Rejected samples have rejection_reason = "secret_detected:{type_list}" where type_list is a sorted comma-separated list of detected secret types.

secret_detected:AWSKeyDetector,GitHubTokenDetector
secret_detected:Base64HighEntropyString
secret_detected:PrivateKeyDetector

The provenance record on each sample (passed or rejected) includes:

{
  "passed": false,
  "secret_type_counts": {"AWSKeyDetector": 1},
  "fields_scanned": ["instruction", "input", "output"],
  "total_findings": 1
}

PIIPseudonymizer¶

Provenance records log entity type counts — never the original or replaced values.

{
  "entities_replaced": {"PERSON": 3, "EMAIL_ADDRESS": 1},
  "fields_processed": ["instruction", "input", "output"],
  "total_replacements": 4
}

ToxicityGate¶

toxic_content:classifier:0.621          # rejected at Stage 1 (local classifier), score=0.621
toxic_content:llm_judge:0.710           # borderline at Stage 1, escalated to Stage 2 (LLM judge), rejected

Provenance on passing samples (the phase key records which stage decided — "classifier" or "llm_judge"):

{
  "passed": true,
  "max_score": 0.042,
  "worst_field": "instruction",
  "phase": "classifier"
}

Tuning guide¶

SecretsGate false positives¶

code_corpus_mode=False disables KeywordDetector by default. If you still see false positives in prose corpora, check which plugin is triggering using result.rejected[i].provenance_chain[-1].notes["secret_type_counts"]. High entropy strings that aren't secrets (base64 images, encoded payloads) can be addressed by raising the entropy thresholds:

SecretsGate(plugins=[
    {"name": "AWSKeyDetector"},
    {"name": "GitHubTokenDetector"},
    {"name": "PrivateKeyDetector"},
    {"name": "Base64HighEntropyString", "base64_limit": 5.5},  # raised from 4.5
    {"name": "HexHighEntropyString",    "hex_limit":    4.0},  # raised from 3.0
])

PIIPseudonymizer over-redaction¶

Lower score_threshold (e.g. 0.5) catches more PII but also mislabels common nouns as entities. Start at 0.7 and lower only if real PII is slipping through. Use en_core_web_lg over en_core_web_sm for higher precision.

ToxicityGate thresholds for academic corpora¶

Academic text discussing crime, medication, or social issues typically scores 0.1–0.25 on toxicity even when completely clean. If you see excessive LLM escalations, raise classifier_pass_threshold to 0.2:

CuratorConfig(
    toxicity_gate                      = True,
    toxicity_classifier_pass_threshold = 0.2,   # raised for academic/legal/medical corpora
    toxicity_classifier_reject_threshold = 0.6,
)

Use detoxify_model="multilingual" for non-English corpora. Use "unbiased" (default) over "original" for any corpus with legitimate discussion of sensitive topics — the unbiased model suppresses false positives.

Parameter reference¶

`CuratorConfig` hygiene fields¶

Field	Default	Description
`secrets_gate`	`False`	Enable SecretsGate
`secrets_code_corpus_mode`	`False`	Enable KeywordDetector (for code datasets)
`pii_pseudonymize`	`False`	Enable PIIPseudonymizer
`pii_entity_types`	`[]`	Presidio entity types; `[]` = default set (no DATE_TIME)
`pii_score_threshold`	`0.7`	Presidio detection confidence threshold
`pii_spacy_model`	`"en_core_web_lg"`	spaCy model name
`pii_faker_seed`	`42`	Faker seed for reproducible replacements
`toxicity_gate`	`False`	Enable ToxicityGate
`toxicity_classifier_pass_threshold`	`0.1`	Score below this → immediate pass
`toxicity_classifier_reject_threshold`	`0.5`	Score above this → immediate reject
`toxicity_detoxify_model`	`"unbiased"`	`"unbiased"` \| `"original"` \| `"multilingual"`
`toxicity_llm_judge`	`False`	Use LLM for borderline band (requires `llm_model`)
`toxicity_llm_reject_threshold`	`0.5`	LLM judge score at or above which a borderline sample is rejected (only when `toxicity_llm_judge=True`)

YAML `GateConfig` fields (`type: toxicity`, `type: secrets`)¶

Field	Default	Description
`toxicity_classifier_pass_threshold`	`0.1`	Classifier pass threshold
`toxicity_classifier_reject_threshold`	`0.5`	Classifier reject threshold
`toxicity_detoxify_model`	`"unbiased"`	Detoxify model variant
`toxicity_llm_model`	`null`	LLM model for Stage 2 (LLM judge); `null` = no judge
`secrets_code_corpus_mode`	`false`	Enable KeywordDetector
`secrets_fields`	`[]`	Fields to scan; `[]` = task-aware auto-selection

YAML `NormalizerConfig` fields (`type: pii_pseudonymizer`)¶

Field	Default	Description
`pii_entity_types`	`[]`	Presidio entity types; `[]` = default set
`pii_score_threshold`	`0.7`	Presidio confidence threshold
`pii_spacy_model`	`"en_core_web_lg"`	spaCy model name
`pii_faker_seed`	`42`	Faker seed
`pii_language`	`"en"`	Analysis language
`pii_fields`	`[]`	Fields to process; `[]` = task-aware auto-selection

This is the last guide. For the full parameter listing, see the Config reference.

Data Hygiene¶

Install¶

Channel 1 — CuratorConfig Python API¶

SecretsGate¶

PIIPseudonymizer¶

ToxicityGate¶

All three together¶

Channel 2 — YAML pipeline config (CLI)¶

Running a pipeline¶

SecretsGate in YAML¶

PIIPseudonymizer in YAML¶

ToxicityGate in YAML¶

Full hygiene pipeline YAML example¶

Channel 3 — Direct module import¶

SecretsGate¶

PIIPseudonymizer¶

ToxicityGate¶

Composing in a custom pipeline¶

Task awareness¶

Rejection reasons and provenance¶

SecretsGate¶

PIIPseudonymizer¶

ToxicityGate¶

Tuning guide¶

SecretsGate false positives¶

PIIPseudonymizer over-redaction¶

ToxicityGate thresholds for academic corpora¶

Parameter reference¶

CuratorConfig hygiene fields¶

YAML GateConfig fields (type: toxicity, type: secrets)¶

YAML NormalizerConfig fields (type: pii_pseudonymizer)¶

Channel 1 — `CuratorConfig` Python API¶

`CuratorConfig` hygiene fields¶

YAML `GateConfig` fields (`type: toxicity`, `type: secrets`)¶

YAML `NormalizerConfig` fields (`type: pii_pseudonymizer`)¶