Skip to content

Normalizers

In-place transforms: cleaning, deduplication, sampling, truncation.

curatorkit.normalizers

TextCleaner

TextCleaner(transforms: dict[str, bool] | None = None, fields: list[str] | None = None)

Bases: BaseNormalizer

Apply a configurable chain of text-cleaning transforms to each DataSample.

Parameters:

Name Type Description Default
transforms dict[str, bool] | None

Dict of transform name -> bool. Missing keys default to True. Keys: strip_html, normalise_unicode, fix_encoding_artifacts, collapse_whitespace, remove_control_chars.

None
fields list[str] | None

DataSample fields to clean (default: instruction, input, output).

None

ExactDeduplicator

Bases: BaseNormalizer

Remove exact duplicates by SHA-256 - hash key is task-type-aware.

preference / implicit_preference → instruction + chosen + rejected grpo → instruction + all responses language_modeling → output everything else (SFT) → instruction + output

MinHashDeduplicator

MinHashDeduplicator(threshold: float = 0.85, ngram: int = 3, num_perm: int = 128, seed: int = 42, bands: int | None = None, rows: int | None = None)

Bases: BaseNormalizer

Remove near-duplicate samples using MinHash + LSH banding.

Algorithm
  1. Tokenise each instruction into character n-grams (default n=3).
  2. Compute a k-dimensional MinHash signature per sample using k independent universal hash functions (default k=128).
  3. Split each signature into bands rows of rows ints; bands × rows = k. The (bands, rows) pair is auto-tuned so the LSH s-curve's 0.5 inflection sits near threshold.
  4. Two signatures collide in a band iff their rows-int sub-vector is equal. We only verify Jaccard for collision candidates — turning the O(n²) brute-force into ~O(n) under typical thresholds.
  5. If estimated Jaccard >= threshold, the later sample is a near-dup.

The threshold is a tunable config value — not a code constant. A threshold of 0.85 is a starting point, not a universal answer.

StratifiedSampler

StratifiedSampler(category_field: str = 'source_dataset', target_distribution: dict[str, float] | None = None, seed: int = 42)

Bases: BaseNormalizer

Downsample over-represented categories; warn on under-represented ones.

Parameters:

Name Type Description Default
category_field str

Key in DataSample.metadata (or top-level field) that holds the category label. Defaults to 'task_type' (top-level field).

'source_dataset'
target_distribution dict[str, float] | None

Mapping of category -> max fraction of total output. e.g. {'coding': 0.25, 'general': 0.40} Categories not listed are passed through without downsampling.

None
seed int

Random seed for reproducible sampling.

42

MaxSamplesTruncator

MaxSamplesTruncator(max_samples: int)

Bases: BaseNormalizer

Keep only the first max_samples samples, dropping the rest.

Useful for smoke tests and cost-bounded runs. Applied after readers (CLI pipelines) or after resampling (Curator).

run

run(samples: list[DataSample]) -> list[DataSample]

Return the first max_samples entries of samples.

__getattr__

__getattr__(name: str)

Lazy import for EmbeddingDeduplicator (requires the [embedding] extra).