Gates¶
Quality gates: pass samples through or reject them with a structured reason.
curatorkit.gates ¶
DiversityGate ¶
DiversityGate(embedding_model: str = 'sentence-transformers/all-MiniLM-L6-v2', similarity_threshold: float = 0.92, text_field: str = 'auto', coverage_field: str | None = None, batch_size: int = 64, device: str | None = None)
Bases: BaseGate
Reject samples that are semantically too similar to existing ones.
Parameters¶
embedding_model : str Sentence-transformers model name. similarity_threshold : float Cosine similarity above this → reject as near-duplicate. text_field : str Which DataSample field to embed. "auto" picks based on task_type. coverage_field : str Metadata or DataSample field to check for category coverage gaps. batch_size : int Encoding batch size for the embedding model.
HallucinationGate ¶
HallucinationGate(llm: BaseLLM, threshold: float = 0.7, prompt_template: str | None = None, skip_if_no_context: bool = True, concurrency: int = 16)
Bases: BaseGate
Verify generated answers are grounded in their source text.
Parameters¶
llm : BaseLLM LLM backend for grounding judgement. threshold : float Minimum grounding score (0-1). Samples below this are rejected. prompt_template : str | None Custom grounding evaluation prompt. skip_if_no_context : bool If True, samples without source context pass through. If False, samples without source context are rejected.
run_async
async
¶
Async execution — uses agenerate() with semaphore-bounded concurrency.
RewardGate ¶
RewardGate(llm: BaseLLM, threshold: float = 0.7, dimensions: list[str] | None = None, prompt_template: str | None = None, store_score_in_label: bool = True, concurrency: int = 16)
Bases: BaseGate
Quality-score samples using an LLM judge and reject below threshold.
Parameters¶
llm : BaseLLM LLM backend for quality judgement. threshold : float Minimum quality score (0-1). Samples below this are rejected. dimensions : list[str] Quality dimensions to evaluate. Defaults to core UltraFeedback set. prompt_template : str | None Custom reward evaluation prompt. store_score_in_label : bool If True, store the overall score in DataSample.label.
run_async
async
¶
Async execution — uses agenerate() with semaphore-bounded concurrency.
SchemaGate ¶
SchemaGate(required_fields: list[str] | None = None, min_tokens: int = 10, max_tokens: int = 2048, use_tiktoken: bool = False, enforce_task_types: list[str] | None = None)
Bases: BaseGate
Validate samples against field, token-length, and encoding constraints.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
required_fields
|
list[str] | None
|
Fields that must be non-empty. Default behaviour is auto-derived from the sample's task_type. Explicitly setting this overrides the automatic per-task-type check entirely. |
None
|
min_tokens
|
int
|
Minimum token count for the primary text fields. |
10
|
max_tokens
|
int
|
Maximum token count for the primary text fields. |
2048
|
use_tiktoken
|
bool
|
Use tiktoken cl100k_base instead of whitespace tokenizer. |
False
|
enforce_task_types
|
list[str] | None
|
If non-empty, only samples with these task_type values pass. Useful for single-paradigm pipelines (e.g. pure DPO). |
None
|