Quality Filtering¶

Quality gates run after generation (or after cleaning, for non-generation pipelines). Each gate produces (passed, rejected) — rejected samples are never silently dropped; they become RejectedSample objects with a structured rejection_reason string and land in rejected.jsonl.

Gates are optional and additive. Configure any combination. The execution order is fixed:

[Generation] → HallucinationGate → RewardGate → DiversityGate → [Exporters]

SchemaGate¶

Runs by default, immediately after all readers. Disable with schema_gate=False.

Validates field presence, token length, and encoding for each task_type:

task_type	Required fields
`preference`, `implicit_preference`	`chosen` + `rejected`
`grpo`	`instruction` + `responses` (non-empty list)
`language_modeling`	`output`
`prompt_only`	`instruction`
`source_chunk`	`input`
Default (SFT)	`instruction` + `output`

CuratorConfig(
    min_tokens    = 10,
    max_tokens    = 2048,
    use_tiktoken  = False,   # True for tiktoken cl100k token counts
    schema_gate   = True,
)

Rejection reasons: missing_field:{field}, below_min_tokens:{n}, above_max_tokens:{n}, encoding_error:null_byte_in_{field}

HallucinationGate¶

Verifies that each generated answer is grounded in its source text using an LLM judge (CheckEval-style). Only fires when hallucination_threshold is set.

CuratorConfig(
    hallucination_threshold = 0.7,    # None = gate off
    judge_llm_model         = "openai/gpt-4o",   # recommended: different from generator
)

What it checks: The judge receives source_text (from sample.input) and answer (from sample.output or sample.chosen) and scores grounding 0.0–1.0.

Key behaviour: - Samples with no input (no source context) pass through silently by default. To reject them instead, use skip_if_no_context: false in the YAML gate config or construct HallucinationGate(skip_if_no_context=False) directly — this parameter is not exposed in CuratorConfig. - The judge always sees the exact source chunk that generated the sample — not a retrieved approximation.

Rejection reason: hallucination_contract_failed:{score:.2f}

Same-model warning: If judge_llm_model is the same model as llm_model, the judge scores its own outputs leniently. Use a different model for the judge whenever possible.

RewardGate¶

Scores response quality across configurable dimensions using an LLM judge (UltraFeedback rubric). Only fires when reward_threshold is set.

CuratorConfig(
    reward_threshold   = 0.7,     # None = gate off
    reward_dimensions  = ["helpfulness", "honesty", "instruction_following", "depth"],
    judge_llm_model    = "openai/gpt-4o",
)

Available dimensions (all built-in, no custom strings):

Dimension	What it penalises
`helpfulness`	Unhelpful, irrelevant responses
`honesty`	False or misleading claims
`instruction_following`	Not addressing what was asked
`truthfulness`	Factual errors
`depth`	Shallow, vague answers lacking specifics
`creativity`	Formulaic, uncreative responses
`coherence`	Poor structure, hard to follow

The overall_score is the mean of all dimension scores. Samples where overall_score < threshold are rejected.

Custom rubrics: If the built-in dimensions don't fit your domain, use reward_prompt_template to write any rubric you want. See Customisation.

Rejection reason: below_reward_threshold:{score:.2f}

Dual-scoring for DPO preference pairs¶

When the sample has both chosen and rejected fields, RewardGate scores both:

chosen_score  >= threshold  → chosen is good enough
rejected_score < threshold  → rejected is noticeably worse

Both conditions must hold to pass.

Failure	Rejection reason	Meaning
`chosen_score < threshold`	`dpo_pair_failed:chosen_below_threshold:{score}`	Chosen response too weak
`rejected_score >= threshold`	`dpo_pair_failed:rejected_above_threshold:{score}`	Rejected response too good — insufficient quality contrast

rejected_above_threshold is a generation problem, not a gate problem. The probe cannot recover it (it exits immediately with 0 LLM calls). Fix it by: raising reward_threshold, adding "depth" to reward_dimensions, using preference_mode="two_pass", or using a different judge model.

DiversityGate¶

Filters samples that are too semantically similar to already-accepted samples in the batch. Uses sentence-transformer embeddings + cosine similarity. Requires pip install "curatorkit[embedding]".

CuratorConfig(
    # None = gate off. Samples with similarity ABOVE this value are rejected,
    # so a lower threshold rejects more (stricter).
    diversity_threshold = 0.92,
    embedding_model     = "sentence-transformers/all-MiniLM-L6-v2",
    embedding_device    = None,   # "cuda" | "cpu" | None (auto-detect)
)

What text is embedded (automatic by task_type): - preference → instruction + chosen - grpo → instruction + first_response - conversational → instruction + output - Default → instruction + output

ANN backend: Uses FAISS if installed ([embedding-faiss]), falls back to numpy brute-force otherwise.

Rejection reason: diversity_gate:too_similar:{similarity:.3f}

Cross-run embedding deduplication¶

Separate from DiversityGate — this persists an embedding index across multiple runs so you can deduplicate against previously generated data.

CuratorConfig(
    embedding_dedup           = True,
    embedding_index_dir       = "output/embedding_index",
    embedding_dedup_threshold = 0.92,
    embedding_reset_index     = False,   # True = start fresh
)

Combining gates¶

All three gates can run together. Each one's rejects are independent — a sample that fails HallucinationGate never reaches RewardGate. Read per-gate counts in result.stage_counts:

result.stage_counts["HallucinationGate"]
# {"input_count": 2850, "output_count": 2210, "probe_recovered": 0, "rejected_count": 640}
result.stage_counts["RewardGate"]
# {"input_count": 2210, "output_count": 1890, "probe_recovered": 0, "rejected_count": 320}
result.stage_counts["DiversityGate"]
# {"input_count": 1890, "output_count": 1740, "probe_recovered": 0, "rejected_count": 150}

(probe_recovered is non-zero when the diagnostic probe is enabled and recovers samples inline.)

Rejection reason reference¶

Reason string	Gate	Fix
`missing_field:{field}`	Schema	Check field_mapping or generation task output
`below_min_tokens:{n}`	Schema	Lower `min_tokens` or filter short source chunks
`above_max_tokens:{n}`	Schema	Raise `max_tokens` or chunk source documents
`hallucination_contract_failed:{score}`	Hallucination	Lower threshold, use different judge, or enable probe
`below_reward_threshold:{score}`	Reward	Lower threshold, change dimensions, or enable refiner
`dpo_pair_failed:chosen_below_threshold:{score}`	Reward	Enable refiner; chosen response too weak
`dpo_pair_failed:rejected_above_threshold:{score}`	Reward	Generation contrast too small; fix in generation config
`diversity_gate:too_similar:{similarity}`	Diversity	Raise `diversity_threshold` or reduce `num_questions`