Normalizers¶
In-place transforms: cleaning, deduplication, sampling, truncation.
curatorkit.normalizers ¶
TextCleaner ¶
Bases: BaseNormalizer
Apply a configurable chain of text-cleaning transforms to each DataSample.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transforms
|
dict[str, bool] | None
|
Dict of transform name -> bool. Missing keys default to True. Keys: strip_html, normalise_unicode, fix_encoding_artifacts, collapse_whitespace, remove_control_chars. |
None
|
fields
|
list[str] | None
|
DataSample fields to clean (default: instruction, input, output). |
None
|
ExactDeduplicator ¶
Bases: BaseNormalizer
Remove exact duplicates by SHA-256 - hash key is task-type-aware.
preference / implicit_preference → instruction + chosen + rejected grpo → instruction + all responses language_modeling → output everything else (SFT) → instruction + output
MinHashDeduplicator ¶
MinHashDeduplicator(threshold: float = 0.85, ngram: int = 3, num_perm: int = 128, seed: int = 42, bands: int | None = None, rows: int | None = None)
Bases: BaseNormalizer
Remove near-duplicate samples using MinHash + LSH banding.
Algorithm
- Tokenise each instruction into character n-grams (default n=3).
- Compute a k-dimensional MinHash signature per sample using k independent universal hash functions (default k=128).
- Split each signature into
bandsrows ofrowsints; bands × rows = k. The (bands, rows) pair is auto-tuned so the LSH s-curve's 0.5 inflection sits nearthreshold. - Two signatures collide in a band iff their
rows-int sub-vector is equal. We only verify Jaccard for collision candidates — turning the O(n²) brute-force into ~O(n) under typical thresholds. - If estimated Jaccard >= threshold, the later sample is a near-dup.
The threshold is a tunable config value — not a code constant. A threshold of 0.85 is a starting point, not a universal answer.
StratifiedSampler ¶
StratifiedSampler(category_field: str = 'source_dataset', target_distribution: dict[str, float] | None = None, seed: int = 42)
Bases: BaseNormalizer
Downsample over-represented categories; warn on under-represented ones.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category_field
|
str
|
Key in DataSample.metadata (or top-level field) that holds the category label. Defaults to 'task_type' (top-level field). |
'source_dataset'
|
target_distribution
|
dict[str, float] | None
|
Mapping of category -> max fraction of total output. e.g. {'coding': 0.25, 'general': 0.40} Categories not listed are passed through without downsampling. |
None
|
seed
|
int
|
Random seed for reproducible sampling. |
42
|
MaxSamplesTruncator ¶
Bases: BaseNormalizer
Keep only the first max_samples samples, dropping the rest.
Useful for smoke tests and cost-bounded runs. Applied after readers (CLI pipelines) or after resampling (Curator).
run ¶
Return the first max_samples entries of samples.
__getattr__ ¶
Lazy import for EmbeddingDeduplicator (requires the [embedding] extra).