Gates¶

Quality gates: pass samples through or reject them with a structured reason.

curatorkit.gates ¶

DiversityGate ¶

DiversityGate(embedding_model: str = 'sentence-transformers/all-MiniLM-L6-v2', similarity_threshold: float = 0.92, text_field: str = 'auto', coverage_field: str | None = None, batch_size: int = 64, device: str | None = None)

Bases: BaseGate

Reject samples that are semantically too similar to existing ones.

Parameters¶

embedding_model : str Sentence-transformers model name. similarity_threshold : float Cosine similarity above this → reject as near-duplicate. text_field : str Which DataSample field to embed. "auto" picks based on task_type. coverage_field : str Metadata or DataSample field to check for category coverage gaps. batch_size : int Encoding batch size for the embedding model.

HallucinationGate ¶

HallucinationGate(llm: BaseLLM, threshold: float = 0.7, prompt_template: str | None = None, skip_if_no_context: bool = True, concurrency: int = 16)

Bases: BaseGate

Verify generated answers are grounded in their source text.

Parameters¶

llm : BaseLLM LLM backend for grounding judgement. threshold : float Minimum grounding score (0-1). Samples below this are rejected. prompt_template : str | None Custom grounding evaluation prompt. skip_if_no_context : bool If True, samples without source context pass through. If False, samples without source context are rejected.

run_async `async` ¶

run_async(samples: list[DataSample]) -> tuple[list[DataSample], list[RejectedSample]]

Async execution — uses agenerate() with semaphore-bounded concurrency.

RewardGate ¶

RewardGate(llm: BaseLLM, threshold: float = 0.7, dimensions: list[str] | None = None, prompt_template: str | None = None, store_score_in_label: bool = True, concurrency: int = 16)

Bases: BaseGate

Quality-score samples using an LLM judge and reject below threshold.

Parameters¶

llm : BaseLLM LLM backend for quality judgement. threshold : float Minimum quality score (0-1). Samples below this are rejected. dimensions : list[str] Quality dimensions to evaluate. Defaults to core UltraFeedback set. prompt_template : str | None Custom reward evaluation prompt. store_score_in_label : bool If True, store the overall score in DataSample.label.

run_async `async` ¶

run_async(samples: list[DataSample]) -> tuple[list[DataSample], list[RejectedSample]]

Async execution — uses agenerate() with semaphore-bounded concurrency.

SchemaGate ¶

SchemaGate(required_fields: list[str] | None = None, min_tokens: int = 10, max_tokens: int = 2048, use_tiktoken: bool = False, enforce_task_types: list[str] | None = None)

Bases: BaseGate

Validate samples against field, token-length, and encoding constraints.

Parameters:

Name	Type	Description	Default
`required_fields`	`list[str] \| None`	Fields that must be non-empty. Default behaviour is auto-derived from the sample's task_type. Explicitly setting this overrides the automatic per-task-type check entirely.	`None`
`min_tokens`	`int`	Minimum token count for the primary text fields.	`10`
`max_tokens`	`int`	Maximum token count for the primary text fields.	`2048`
`use_tiktoken`	`bool`	Use tiktoken cl100k_base instead of whitespace tokenizer.	`False`
`enforce_task_types`	`list[str] \| None`	If non-empty, only samples with these task_type values pass. Useful for single-paradigm pipelines (e.g. pure DPO).	`None`

Gates¶

curatorkit.gates ¶

DiversityGate ¶

Parameters¶

HallucinationGate ¶

Parameters¶

run_async async ¶

RewardGate ¶

Parameters¶

run_async async ¶

SchemaGate ¶

run_async `async` ¶

run_async `async` ¶