Normalizers¶

In-place transforms: cleaning, deduplication, sampling, truncation.

curatorkit.normalizers ¶

TextCleaner ¶

TextCleaner(transforms: dict[str, bool] | None = None, fields: list[str] | None = None)

Bases: BaseNormalizer

Apply a configurable chain of text-cleaning transforms to each DataSample.

Parameters:

Name	Type	Description	Default
`transforms`	`dict[str, bool] \| None`	Dict of transform name -> bool. Missing keys default to True. Keys: strip_html, normalise_unicode, fix_encoding_artifacts, collapse_whitespace, remove_control_chars.	`None`
`fields`	`list[str] \| None`	DataSample fields to clean (default: instruction, input, output).	`None`

ExactDeduplicator ¶

Bases: BaseNormalizer

Remove exact duplicates by SHA-256 - hash key is task-type-aware.

preference / implicit_preference → instruction + chosen + rejected grpo → instruction + all responses language_modeling → output everything else (SFT) → instruction + output

MinHashDeduplicator ¶

MinHashDeduplicator(threshold: float = 0.85, ngram: int = 3, num_perm: int = 128, seed: int = 42, bands: int | None = None, rows: int | None = None)

Bases: BaseNormalizer

Remove near-duplicate samples using MinHash + LSH banding.

Algorithm

Tokenise each instruction into character n-grams (default n=3).
Compute a k-dimensional MinHash signature per sample using k independent universal hash functions (default k=128).
Split each signature into bands rows of rows ints; bands × rows = k. The (bands, rows) pair is auto-tuned so the LSH s-curve's 0.5 inflection sits near threshold.
Two signatures collide in a band iff their rows-int sub-vector is equal. We only verify Jaccard for collision candidates — turning the O(n²) brute-force into ~O(n) under typical thresholds.
If estimated Jaccard >= threshold, the later sample is a near-dup.

The threshold is a tunable config value — not a code constant. A threshold of 0.85 is a starting point, not a universal answer.

StratifiedSampler ¶

StratifiedSampler(category_field: str = 'source_dataset', target_distribution: dict[str, float] | None = None, seed: int = 42)

Bases: BaseNormalizer

Downsample over-represented categories; warn on under-represented ones.

Parameters:

Name	Type	Description	Default
`category_field`	`str`	Key in DataSample.metadata (or top-level field) that holds the category label. Defaults to 'task_type' (top-level field).	`'source_dataset'`
`target_distribution`	`dict[str, float] \| None`	Mapping of category -> max fraction of total output. e.g. {'coding': 0.25, 'general': 0.40} Categories not listed are passed through without downsampling.	`None`
`seed`	`int`	Random seed for reproducible sampling.	`42`

MaxSamplesTruncator ¶

MaxSamplesTruncator(max_samples: int)

Bases: BaseNormalizer

Keep only the first max_samples samples, dropping the rest.

Useful for smoke tests and cost-bounded runs. Applied after readers (CLI pipelines) or after resampling (Curator).

run ¶

run(samples: list[DataSample]) -> list[DataSample]

Return the first max_samples entries of samples.

getattr ¶

__getattr__(name: str)

Lazy import for EmbeddingDeduplicator (requires the [embedding] extra).

Normalizers¶

curatorkit.normalizers ¶

TextCleaner ¶

ExactDeduplicator ¶

MinHashDeduplicator ¶

StratifiedSampler ¶

MaxSamplesTruncator ¶

run ¶

__getattr__ ¶

getattr ¶