Skip to content

CuratorKIT Architecture

DataSample contract

DataSample is the canonical unit of data moving through the pipeline.

Field Type Default Notes
id str auto UUID4 Stable identifier
source_uri str required URI of the source document or file
instruction str "" Prompt / instruction (SFT, preference prompt)
input str "" Optional Alpaca-style context; also stores source chunk text for QA-generated samples
output str "" Target response (SFT); full text (pretrain)
chosen str "" Preferred completion (DPO preference)
rejected str "" Dispreferred completion (DPO preference)
label float \| None None Quality label (unpaired preference / reward)
responses list[str] [] GRPO group rollouts
reward_scores list[float] [] Paired with responses
task_type str "instruction_following" See vocabulary below
metadata dict {} Unmapped columns and step-specific data
provenance_chain list[ProvenanceRecord] [] Append-only

task_type vocabulary

Value Active fields Typical use
instruction_following instruction, input, output Alpaca SFT
conversational instruction, output, metadata["turns"] ShareGPT / ChatML SFT
preference instruction, chosen, rejected DPO with explicit prompt
implicit_preference chosen, rejected (instruction = extracted prefix) DPO implicit prompt
unpaired_preference instruction, output, label Reward model training
grpo instruction, responses, reward_scores GRPO rollouts
prompt_only instruction PPO prompt collection
language_modeling output Continued pre-training; PDF chunk mode
source_chunk input Raw corpus chunks awaiting generation

RejectedSample contract

RejectedSample extends DataSample and is emitted by every step that drops a sample. No step silently discards — every rejection produces a RejectedSample with a structured reason.

Field Type Notes
rejection_reason str Structured string, e.g. "hallucination_contract_failed:0.42"
rejecting_step str Class name of the step that rejected the sample
diagnosis FailureDiagnosis \| None Populated by DiagnosticProbe when enable_diagnostic_probe=True; None otherwise

The diagnosis field is serialised to rejected.jsonl via an overridden model_dump()FailureDiagnosis is a plain dataclass, not a Pydantic model, so the override calls .to_dict() before serialisation.


Plugin interface guide

All pipeline components implement one of four ABCs in curatorkit/interfaces.py:

BaseReader     → read() -> tuple[list[DataSample], list[RejectedSample]]
BaseGate       → run(samples) -> tuple[list[DataSample], list[RejectedSample]]
BaseNormalizer → run(samples) -> list[DataSample]
BaseExporter   → export(samples, output_dir) -> None

Note: BaseReader.read() returns a tuple identical to BaseGate.run(). Reader-level parse failures are first-class RejectedSample objects.


Connector layer

All connectors inherit from BaseConnector in curatorkit/connectors/base.py. Subclasses only implement _iter_rows() which yields (line_no, raw_dict) tuples. The base class owns the full pipeline for each row:

raw_dict
  → preprocessing_fn   (optional user callable — structural normalisation)
  → field_mapping      (flat key renames; dot notation for nested paths)
  → FormatDetector     (3-layer detection, committed from first N rows)
  → DataSample construction (format-specific, with role normalisation)
  → RejectedSample     (for every failure — no silent drops)

Supported connector types

Type Class File Optional dep
jsonl JSONLReader connectors/jsonl.py none
json JSONReader connectors/json_reader.py none
csv CSVReader connectors/csv_reader.py none
parquet ParquetReader connectors/parquet_reader.py pyarrow
huggingface HuggingFaceReader connectors/huggingface.py datasets
pdf PDFReader connectors/pdf.py mineru (install extra: pdf)

PDFReader

PDFReader chunks PDF documents and can operate in two modes controlled by output_mode:

output_mode Behaviour
"chunk" Emits one DataSample per text chunk (output = chunk text)
"qa" / "preference" / "grpo" / "multiturn" Internally runs an LLM generation pass and emits generated samples directly

Chunks use a heading-aware strategy with configurable chunk_max_tokens (default 512) and chunk_overlap_tokens (default 50). Each chunk sample carries metadata["chunk_index"], metadata["page"], metadata["parent_heading"], and metadata["source_file"] — these are forwarded to generated samples so chunk-level provenance survives generation.


Format detection system

curatorkit/detection/detector.pyFormatDetector

Layer 1 — column-set candidate generation

Uses semantic equivalence classes rather than exact string matches: - INSTRUCTION_COLS: instruction, prompt, query, question, input, ... - OUTPUT_COLS: output, response, completion, answer, ... - CHOSEN_COLS: chosen, preferred, accepted, response_a, ... - REJECTED_COLS: rejected, refused, dispreferred, response_b, ...

Generates all matching candidates in priority order. Does not stop at first match.

Layer 2 — value-type validation

Validates actual cell values against the expected type for each candidate format. A sharegpt candidate requires the conversations column value to be list[dict], not a flat string. Candidates that fail layer 2 are dropped; the next candidate is tried.

Layer 3 — role alias normalisation

Applied during DataSample construction after format is committed. Normalises from/valuerole/content and expands role aliases: - human, user, input"user" - gpt, assistant, model, output"assistant" - system"system"

Detection result confidence

Level Meaning
HIGH Exact or near-exact column-set match, layer 2 passed
MEDIUM Partial match or layer 2 passed without sample rows
LOW Best guess only — pipeline warns, emits provenance note
UNKNOWN No candidate survived — rows become RejectedSample

SchemaGate

SchemaGate is task-type-aware.

Default field validation per task_type:

task_type Fields checked
SFT family instruction + output non-empty
preference / implicit_preference chosen + rejected non-empty
grpo instruction + responses (non-empty list)
language_modeling output non-empty
prompt_only instruction non-empty; output may be empty
source_chunk input non-empty; instruction/output generated later

Token counting is also task-type-aware: - SFT: instruction + output - Preference: instruction + the longer of chosen / rejected - GRPO: instruction + longest response - Pretrain: output - Source chunk: input


LLM abstraction layer

curatorkit/llm/ provides a provider-agnostic LLM interface.

BaseLLM (ABC)
  .generate(messages, **kwargs) -> LLMResponse     # sync
  .agenerate(messages, **kwargs) -> LLMResponse    # async
  .config_hash() -> str                            # stable hash for provenance
  .temperature: float                              # mutable default; callers can override per call

LiteLLMBackend   → routes to OpenAI, Anthropic, Gemini, Mistral, vLLM, etc.
OllamaBackend    → routes to local Ollama server

LLMResponse carries .text, .model, .usage (token counts), and .to_provenance_dict() which serialises model name, token usage, and finish reason into the ProvenanceRecord.notes dict.


Generation tasks

curatorkit/generators/ — all tasks extend BaseGenerationTask(BaseNormalizer).

BaseGenerationTask
  ._build_messages(sample) -> list[dict]   # abstract — build the LLM prompt
  ._parse_response(sample, response) -> list[DataSample]  # abstract — parse output
  .run(samples) -> list[DataSample]        # sync, one LLM call per sample
  .run_async(samples) -> list[DataSample]  # async, concurrency-controlled
  .flush_rejected() -> list[RejectedSample]  # return and clear accumulated failures

Parse failure handling: if _parse_response() returns [] (parse failed), the source sample is emitted as a RejectedSample with rejection_reason="generation_parse_failed:{task_name}". No silent drops.

Available generation tasks

Task Class Output task_type Notes
QA generation QAGenerationTask instruction_following Stores source chunk in sample.input; forwards chunk_index, page, parent_heading to metadata
Preference pairs PreferenceGenerationTask preference mode="single_call" (1 LLM call) or mode="two_pass" (2 calls, higher quality contrast)
GRPO rollouts GRPORolloutTask grpo Generates N responses per prompt; optional reward scoring via separate LLM call
Multi-turn MultiTurnTask conversational Generates N turn pairs in a simulated dialogue
Evol-Instruct EvolInstructTask instruction_following Rewrites instructions at increasing complexity; optionally generates answers
Chain-of-thought ChainOfThoughtTask instruction_following mode="generate" (new CoT) or mode="wrap" (wrap existing answer)
Adversarial preference AdversarialPreferenceTask preference Faithful chosen + adversarially corrupted rejected; corruption controlled by injection_rate / injection_types
Adversarial QA AdversarialQAGenerationTask instruction_following QA pairs with a controlled fraction of injected failure modes; marks metadata["injected_failure"] and metadata["injection_type"]

Each generated sample carries a ProvenanceRecord with the LLM's model name, token usage, prompt hash, and source sample ID.


Quality gates

Quality gates sit after the generation step and filter on content rather than structure.

HallucinationGate

curatorkit/gates/hallucination.py

Uses an LLM judge to verify that each generated answer is grounded in its source chunk. The source is read from sample.input — the exact chunk text stored by QAGenerationTask. This is provenance-exact: the judge always sees the same chunk that generated the sample, not a retrieved approximation.

Scoring prompt asks the judge to rate grounding on a 0–1 scale. Samples below threshold (default 0.7) become RejectedSample with rejection_reason="hallucination_contract_failed:{score:.2f}". Samples with no source context pass through silently by default (skip_if_no_context=True); with skip_if_no_context=False they are rejected with rejection_reason="hallucination_gate:no_source_context".

The ProvenanceRecord includes grounding_score, unsupported_claims, and the judge verdict — the DiagnosticProbe reads these for score-conditioned routing and fallback classification.

Probe attachment point: when enable_diagnostic_probe=True, Curator._build_steps() sets gate.probe = DiagnosticProbe(...) on both the HallucinationGate and the RewardGate instances. The pipeline's gate block checks getattr(step, "probe", None) and calls probe.diagnose_batch(rejected) on each gate's rejections — see the gate block below.

RewardGate

curatorkit/gates/reward.py

Uses an LLM judge to score sample quality across configurable dimensions (default: helpfulness, honesty, instruction_following). Scores are averaged; samples below threshold are rejected. The judge does not receive the source chunk — this is purely instruction-following quality, not faithfulness.

DiversityGate

curatorkit/gates/diversity.py

Embedding-based near-duplicate filter. Uses sentence-transformers to encode each sample and reject those with cosine similarity above similarity_threshold against already-accepted samples in the current batch.


Cross-run embedding deduplication

curatorkit/normalizers/embedding_dedup.pyEmbeddingDeduplicator

Extends within-batch deduplication with a persistent embedding index that survives across pipeline runs. The index is stored as embeddings.npy + metadata.json in embedding_index_dir.

  • Cross-run dedup: new samples are checked against the full index before being checked against each other.
  • Within-run dedup: samples in the current batch that pass the index check are still deduplicated against each other.
  • FAISS-optional: uses FAISS for ANN if installed (pip install "curatorkit[embedding-faiss]"); falls back to NumPy brute-force otherwise.
  • The text_field="auto" mode picks the embedding target based on task_type (e.g. chosen for preference, instruction + output for SFT).

Pipeline orchestration

curatorkit/pipeline.pyPipeline

Pipeline(steps, output_dir, diagnostics=None)
  .run()        -> PipelineResult    # sync
  .run_async()  -> PipelineResult    # async

The runner dispatches on step type. Generation tasks (BaseGenerationTask subclasses) are BaseNormalizer instances with an additional run_async() and flush_rejected(). The pipeline calls flush_rejected() after each generation step and merges the failures into all_rejected.

Gate block — with diagnostic probe

Both run() and run_async() contain identical gate blocks. When a gate has a probe attribute attached, the block diagnoses every rejection in one concurrent batch and merges probe-recovered samples back into the stream:

passed, rejected = gate.run(samples)

probe = getattr(step, "probe", None)
probe_recovered = []
if probe is not None and rejected:
    diagnoses = probe.diagnose_batch(rejected)        # ← concurrent inline probe
    for r, diag in zip(rejected, diagnoses):
        r.diagnosis = diag
        if self._diagnostics is not None:
            self._diagnostics.record(r)
        if diag.recovered_sample is not None:
            probe_recovered.append(diag.recovered_sample)

all_rejected.extend(rejected)
samples = passed + probe_recovered      # recovered samples flow into the next step

Pipelines without a probe are unaffected — getattr(step, "probe", None) returns None and the block is skipped. Each gate's stage_counts entry records the number of probe-recovered samples.

PipelineResult

@dataclass
class PipelineResult:
    passed:             list[DataSample]
    rejected:           list[RejectedSample]
    stage_counts:       dict[str, dict[str, int]]
    wall_clock_seconds: float
    diagnostics:        PipelineDiagnostics | None   # None when probe is inactive

stage_counts has one entry per step: {input_count, output_count, probe_recovered, rejected_count} for gates; {output_count, rejected_count} for readers; {input_count, output_count} for normalizers (generation tasks add rejected_count); {exported_count} for exporters.


CuratorConfig and Curator entry point

curatorkit/curator.py

CuratorConfig is a flat @dataclass (not Pydantic). All pipeline behaviour is driven by its fields — no nested sub-objects.

Key config field groups

Group Fields
Source dataset, split, subset, streaming, max_samples
LLM llm_model, llm_temperature, llm_max_tokens, llm_api_key, llm_api_base, llm_concurrency
Generation generation_task, num_questions, num_responses, num_turns, num_evolutions, difficulty, preference_mode, cot_mode
Quality gates hallucination_threshold, reward_threshold, reward_dimensions, diversity_threshold
Cross-run dedup embedding_dedup, embedding_index_dir, embedding_dedup_threshold, embedding_model
PDF pdf_output_mode, pdf_extract_tables
Adaptive recovery enable_diagnostic_probe, probe_temperatures, probe_score_split, probe_generator_model, probe_extra_templates, enable_reward_refiner
Export output_dir, export_formats

apply_patch()

CuratorConfig.apply_patch(patch: dict) -> CuratorConfig returns a shallow copy with only the patched fields changed. The original config is never mutated. Used by AdaptivePass2Runner (an offline utility — see the diagnostic module section) to apply config patches when re-running rejected samples:

Patch key Config field modified
llm_temperature config.llm_temperature
prompt_template config.llm_prompt_template
context_window consumed by AdaptivePass2Runner directly, not stored
regenerate_field consumed by AdaptivePass2Runner directly, not stored

Curator._build_steps() — step construction order

1.  Readers            (one per dataset in config.dataset)
2.  SchemaGate
3.  ExactDeduplicator  (if dedup != "none")
4.  MinHashDeduplicator (if dedup == "minhash")
5.  TextCleaner        (if clean=True)
── Data hygiene (pre-generation) ──────────────────────────────
6.  SecretsGate        (if secrets_gate=True)
7.  PIIPseudonymizer   (if pii_pseudonymize=True)
8.  ToxicityGate       (if toxicity_gate=True)
────────────────────────────────────────────────────────────────
9.  Generation task    (if generation_task + llm_model set)
10. HallucinationGate  (if hallucination_threshold is not None)
11. RewardGate         (if reward_threshold set)
    └── DiagnosticProbe attached to gates 10 and 11 (if enable_diagnostic_probe=True)
12. DiversityGate      (if diversity_threshold set)
13. EmbeddingDeduplicator (if embedding_dedup=True)
14. StratifiedSampler  (if resample=True)
15. MaxSamplesTruncator (if max_samples set) — applied after resampling so the final distribution is preserved
16. Exporters          (one per format in export_formats)

Curator._build_llm() routes to OllamaBackend for ollama/ prefixed models and LiteLLMBackend for everything else.

When enable_reward_refiner=True, a RewardRefiner is held on the Curator instance (it is not a pipeline step) and runs post-pipeline on RewardGate rejections.


Diagnostic module (adaptive recovery)

curatorkit/diagnostic/ — diagnoses why a gate rejected a sample and attempts an inline, targeted repair before the sample is declared lost.

FailureMode taxonomy

curatorkit/diagnostic/failure_modes.py defines nine modes:

FailureMode (9 modes)
├── HallucinationGate causes
│   ├── SOURCE_AMBIGUOUS       source/response relationship unclear
│   ├── GENERATOR_TEMPERATURE  high temperature causes source drift
│   ├── GENERATOR_PARAMETRIC   model ignores source, uses prior knowledge
│   └── THRESHOLD_MARGINAL     score just below threshold, unstable
├── RewardGate causes
│   ├── INSTRUCTION_QUALITY    generated question is poor quality
│   ├── RESPONSE_QUALITY       generated answer is poor quality
│   └── DOMAIN_MISMATCH        generation prompt wrong for this domain
├── DiversityGate cause
│   └── NEAR_DUPLICATE         too similar to an already-accepted sample
└── Fallback
    └── UNKNOWN                probe inconclusive

The same module defines PROMPT_TEMPLATES — the named re-generation prompts (strict_grounding, domain_specific, generate_question, default) — and the FailureDiagnosis dataclass attached to RejectedSample.diagnosis:

Field Meaning
mode Diagnosed FailureMode
evidence Temperature-sweep pass/fail pattern, e.g. [True, False]
probe_calls Total LLM calls consumed by the probes
notes Extra info (e.g. which prompt variant succeeded)
recovered_sample The passing re-generation, or None if all probes were exhausted

Recovery is inline: when a probe path produces a re-generation that passes the gate, the passing sample is stored in recovered_sample and the pipeline routes it forward immediately. There is no separate second pass.

DiagnosticProbe

curatorkit/diagnostic/probe.py

Score-conditioned probe sequence, run by the pipeline gate block via diagnose_batch(). Routing is conditioned on the grounding_score the HallucinationGate wrote to the sample's provenance chain:

grounding_score >= score_split (default 0.5)   — near-boundary path
  Probe 1 — temperature sweep (default [0.3, 0.5])              max 2 calls
    all pass        → THRESHOLD_MARGINAL
    low-T pass only → GENERATOR_TEMPERATURE
    mixed           → THRESHOLD_MARGINAL
    all fail        → continue to prompt variants
  Probe 2a — strict_grounding prompt                            max 1 call
    passes gate     → GENERATOR_PARAMETRIC
  Probe 2b — domain prompt (domain_specific, or the template    max 1 call
             named by metadata["domain_prompt_key"])
    passes gate     → DOMAIN_MISMATCH
  Probe 2c — instruction re-generation (generate_question)     max 1 call
    passes gate     → INSTRUCTION_QUALITY

grounding_score < score_split                  — clearly-failing path
  Probe 2a (strict_grounding) runs first, then the temperature
  sweep, then the remaining prompt variants.

All probes exhausted → SOURCE_AMBIGUOUS (provenance notes present)
                       or UNKNOWN (no provenance notes at all)

Worst case: 5 LLM calls per rejected sample (2 temperature sweep + 3 prompt variants). DPO pairs rejected with rejected_above_threshold exit immediately as RESPONSE_QUALITY with 0 calls — re-generating the chosen answer cannot fix an insufficient quality contrast.

diagnose() never raises — it catches all exceptions and returns UNKNOWN. diagnose_batch(rejected, concurrency=32) diagnoses samples in parallel; each sample's probe sequence stays sequential internally because each result conditions the next probe.

Source context is read from rejected.input (the exact chunk text stored by the generation task). Context extension with adjacent chunks is deliberately excluded: generator and judge operate on the same fixed context, so cross-chunk probing would produce coincidental rather than causal diagnoses.

Every recovered sample receives a DiagnosticProbe ProvenanceRecord recording the successful probe path and the diagnosed mode, so the provenance chain shows exactly how the sample was recovered.

RewardRefiner

curatorkit/diagnostic/reward_refine.py

Targeted single-retry recovery for RewardGate rejections, run post-pipeline by Curator when enable_reward_refiner=True. For each rejected sample it reads the lowest-scoring dimension and the judge's weakness note from provenance, prompts the generator to rewrite the answer to improve that specific dimension (one LLM call per sample, concurrent), then re-evaluates all candidates in a single bulk RewardGate call. With refine_instruction=True it also rewrites poorly formed questions. rejected_above_threshold pairs are skipped. Refined samples carry metadata["reward_refined"], metadata["refinement_axis"], metadata["refinement_type"], and a RewardRefiner provenance record; for DPO pairs the refined answer becomes the new chosen while the original rejected response is preserved.

PipelineDiagnostics

curatorkit/diagnostic/diagnostics.py

Run-level accumulator attached to Pipeline when enable_diagnostic_probe=True. Provides: - mode_counts()dict[str, int] of failure mode frequencies - probe_recovery_count() — number of samples where the probe produced an inline passing re-generation - total_probe_calls() — total LLM calls consumed by all probes - to_dict() / write_summary(path) — summary dict / writes diagnostic_summary.json

Accessible as CuratorResult.diagnostics after a run.

Offline utilities

Two modules in curatorkit/diagnostic/ are not used by production pipelines:

  • pass2.pyAdaptivePass2Runner re-runs diagnosed rejections against a gate with an explicit force_patch config patch (applied via CuratorConfig.apply_patch()), for offline what-if analysis of repair strategies. Returns a Pass2Result with accepted, rejected, conversion_rate, and per-mode rates in summary; writes rejected_pass2.jsonl and pass2_summary.json to output_dir.
  • retriever.pyPostHocRetriever retrieves the most similar chunk for a query by embedding cosine similarity, for comparing exact provenance tracking against post-hoc retrieval.

Provenance manifest

curatorkit/manifest.py

manifest.json is always written after a run. Key fields:

Field Description
pipeline_config_hash SHA-256 of key config fields
run_timestamp ISO-8601 UTC
stage_counts Per-step input / output / rejected counts
rejected_breakdown Rejection reason → count
dedup_stats Extracted from MinHashDeduplicator provenance
diagnostic_stats PipelineDiagnostics.to_dict()null when the probe is inactive
diagnostic_files ["diagnostic_summary.json"] when the probe is active, else []
tool_versions curatorkit and python version strings

rejected.jsonl is always written (even when empty). Each line is a RejectedSample serialised via model_dump_json(). When a diagnosis is present it appears as a nested diagnosis object with mode, was_recovered, evidence, probe_calls, notes.

Checksums (checksums.txt) cover both *.jsonl and *.json output files.


Exporter registry

Exporter Output file Format
AlpacaExporter sft_alpaca.jsonl {instruction, input, output}
ShareGPTExporter sft_sharegpt.jsonl {conversations: [{from, value}]}
DPOExporter dpo.jsonl {prompt, chosen, rejected}
GRPOExporter grpo.jsonl {prompt, responses, rewards}
PPOExporter ppo.jsonl {prompt}
CorpusExporter corpus.jsonl Raw corpus chunks with full chunk metadata and provenance

YAML schema reference

Validated by PipelineConfig in curatorkit/config.py (pydantic) at CLI startup — a malformed config fails immediately with a clear error, not mid-pipeline. See CLI and YAML for the complete per-key reference.

name: curatorkit_pipeline
version: "0.2.0"

max_samples: null              # cap on total samples, applied right after readers
output_split: null             # e.g. {train: 0.8, val: 0.1, test: 0.1}; fractions sum to 1.0
output_split_seed: 42

readers:
  - type: jsonl | json | csv | parquet | huggingface | pdf
    path: data/file.jsonl

    format: auto | alpaca | sharegpt | preference | implicit_preference |
            unpaired_preference | grpo | prompt_only | pretrain

    field_mapping:
      source_key: canonical_key
      nested.source.key: canonical_key

    preprocessing_fn: "mymodule.my_fn"
    detection_sample_size: 10
    source_uri: null                 # provenance override

    # JSON-specific
    json_data_key: null
    # CSV-specific
    csv_delimiter: null
    csv_parse_json_cells: true
    # Parquet-specific
    parquet_columns: null
    parquet_batch_size: 1000

    # HuggingFace-specific
    hf_split: train
    hf_subset: null
    hf_streaming: false
    hf_token: "${HF_TOKEN}"
    hf_columns: null

    # PDF-specific
    chunk_strategy: heading | sentence | fixed
    chunk_max_tokens: 512
    chunk_overlap_tokens: 50
    extract_tables: false
    ocr: false
    min_section_tokens: 30
    output_mode: chunk | qa | preference | grpo | multiturn
    llm_model: null                  # per-reader LLM override

llm:                                 # global LLM config, shared by generators and gates
  model: openai/gpt-4o-mini
  temperature: 0.7
  max_tokens: 1024
  api_key: null
  api_base: null
  concurrency: 10
  timeout: 120.0
  max_retries: 3
  drop_params: true
  extra_body: {}

generators:
  - type: qa | evol_instruct | preference | grpo | multiturn | cot |
          adversarial_preference | adversarial_qa

    # qa
    num_questions: 3
    difficulty: medium
    # evol_instruct
    num_evolutions: 1
    generate_answers: true
    # preference
    preference_mode: single_call | two_pass
    # grpo
    num_responses: 4
    score_responses: true
    temperature_spread: 0.6
    # multiturn
    num_turns: 3
    # cot
    cot_mode: generate | wrap
    # adversarial_preference / adversarial_qa
    injection_rate: 0.5
    injection_types: []              # empty = all types
    injection_seed: 42
    high_temp: 1.4

    prompt_template: null            # custom prompt override
    llm_model: null                  # per-generator LLM override

gates:
  - type: schema
    min_tokens: 10
    max_tokens: 2048
    use_tiktoken: false

  # ── Hygiene gates (before generators) ──
  - type: secrets
    secrets_code_corpus_mode: false  # true for code datasets

  - type: toxicity
    toxicity_classifier_pass_threshold: 0.1
    toxicity_classifier_reject_threshold: 0.5
    toxicity_detoxify_model: unbiased
    # toxicity_llm_model: openai/gpt-4o-mini   # enable Stage 2 LLM judge for borderline

  # ── Generation-time gates ─────────────────────
  - type: hallucination
    hallucination_threshold: 0.7
    skip_if_no_context: true
    hallucination_llm_model: null    # override global LLM

  - type: reward
    reward_threshold: 0.7
    reward_dimensions: [helpfulness, honesty, instruction_following]
    store_score_in_label: true
    reward_llm_model: null           # override global LLM

  - type: diversity
    similarity_threshold: 0.92
    embedding_model: sentence-transformers/all-MiniLM-L6-v2
    embedding_device: null           # cuda | cpu | mps | null (auto)

normalizers:
  - type: exact_dedup | minhash_dedup | text_cleaner | stratified_sampler |
          embedding_dedup | pii_pseudonymizer

  # PIIPseudonymizer (hygiene — after SecretsGate, before generators)
  - type: pii_pseudonymizer
    pii_score_threshold: 0.7
    pii_spacy_model: en_core_web_lg
    pii_faker_seed: 42
    # pii_entity_types: []           # empty = default types; extend for clinical/legal

  - type: embedding_dedup
    embedding_index_dir: output/embedding_index
    embedding_threshold: 0.92

diagnostic:                          # diagnostic probe (opt-in)
  enable_probe: true
  probe_temperatures: [0.3, 0.5]
  probe_generator_model: null        # null = gate's LLM model
  score_split: 0.5
  extra_templates: {}                # override built-in probe prompt templates

exporters:
  - type: alpaca | sharegpt | dpo | grpo | ppo | corpus

Output file reference

File Always written Description
manifest.json yes Pipeline config hash, stage counts, rejection breakdown
rejected.jsonl yes All RejectedSample objects (includes diagnosis when probe active)
checksums.txt yes SHA-256 for all .jsonl and .json output files
dataset_card.md yes Human-readable pipeline summary
sft_alpaca.jsonl if exported Alpaca-format SFT data
sft_sharegpt.jsonl if exported ShareGPT-format SFT data
dpo.jsonl if exported DPO preference data
grpo.jsonl if exported GRPO rollout data
ppo.jsonl if exported PPO prompt data
corpus.jsonl if exported Raw corpus chunks with provenance metadata
diagnostic_summary.json when probe active Mode counts, recovery counts, probe call totals
rejected_pass2.jsonl offline utility only Written by AdaptivePass2Runner — samples that failed the gate re-check
pass2_summary.json offline utility only Written by AdaptivePass2Runner — per-mode conversion rates