CuratorKIT Architecture¶
DataSample contract¶
DataSample is the canonical unit of data moving through the pipeline.
| Field | Type | Default | Notes |
|---|---|---|---|
id |
str |
auto UUID4 | Stable identifier |
source_uri |
str |
required | URI of the source document or file |
instruction |
str |
"" |
Prompt / instruction (SFT, preference prompt) |
input |
str |
"" |
Optional Alpaca-style context; also stores source chunk text for QA-generated samples |
output |
str |
"" |
Target response (SFT); full text (pretrain) |
chosen |
str |
"" |
Preferred completion (DPO preference) |
rejected |
str |
"" |
Dispreferred completion (DPO preference) |
label |
float \| None |
None |
Quality label (unpaired preference / reward) |
responses |
list[str] |
[] |
GRPO group rollouts |
reward_scores |
list[float] |
[] |
Paired with responses |
task_type |
str |
"instruction_following" |
See vocabulary below |
metadata |
dict |
{} |
Unmapped columns and step-specific data |
provenance_chain |
list[ProvenanceRecord] |
[] |
Append-only |
task_type vocabulary¶
| Value | Active fields | Typical use |
|---|---|---|
instruction_following |
instruction, input, output | Alpaca SFT |
conversational |
instruction, output, metadata["turns"] | ShareGPT / ChatML SFT |
preference |
instruction, chosen, rejected | DPO with explicit prompt |
implicit_preference |
chosen, rejected (instruction = extracted prefix) | DPO implicit prompt |
unpaired_preference |
instruction, output, label | Reward model training |
grpo |
instruction, responses, reward_scores | GRPO rollouts |
prompt_only |
instruction | PPO prompt collection |
language_modeling |
output | Continued pre-training; PDF chunk mode |
source_chunk |
input | Raw corpus chunks awaiting generation |
RejectedSample contract¶
RejectedSample extends DataSample and is emitted by every step that drops a sample. No step silently discards — every rejection produces a RejectedSample with a structured reason.
| Field | Type | Notes |
|---|---|---|
rejection_reason |
str |
Structured string, e.g. "hallucination_contract_failed:0.42" |
rejecting_step |
str |
Class name of the step that rejected the sample |
diagnosis |
FailureDiagnosis \| None |
Populated by DiagnosticProbe when enable_diagnostic_probe=True; None otherwise |
The diagnosis field is serialised to rejected.jsonl via an overridden model_dump() — FailureDiagnosis is a plain dataclass, not a Pydantic model, so the override calls .to_dict() before serialisation.
Plugin interface guide¶
All pipeline components implement one of four ABCs in curatorkit/interfaces.py:
BaseReader → read() -> tuple[list[DataSample], list[RejectedSample]]
BaseGate → run(samples) -> tuple[list[DataSample], list[RejectedSample]]
BaseNormalizer → run(samples) -> list[DataSample]
BaseExporter → export(samples, output_dir) -> None
Note: BaseReader.read() returns a tuple identical to BaseGate.run().
Reader-level parse failures are first-class RejectedSample objects.
Connector layer¶
All connectors inherit from BaseConnector in curatorkit/connectors/base.py.
Subclasses only implement _iter_rows() which yields (line_no, raw_dict) tuples.
The base class owns the full pipeline for each row:
raw_dict
→ preprocessing_fn (optional user callable — structural normalisation)
→ field_mapping (flat key renames; dot notation for nested paths)
→ FormatDetector (3-layer detection, committed from first N rows)
→ DataSample construction (format-specific, with role normalisation)
→ RejectedSample (for every failure — no silent drops)
Supported connector types¶
| Type | Class | File | Optional dep |
|---|---|---|---|
jsonl |
JSONLReader |
connectors/jsonl.py |
none |
json |
JSONReader |
connectors/json_reader.py |
none |
csv |
CSVReader |
connectors/csv_reader.py |
none |
parquet |
ParquetReader |
connectors/parquet_reader.py |
pyarrow |
huggingface |
HuggingFaceReader |
connectors/huggingface.py |
datasets |
pdf |
PDFReader |
connectors/pdf.py |
mineru (install extra: pdf) |
PDFReader¶
PDFReader chunks PDF documents and can operate in two modes controlled by output_mode:
output_mode |
Behaviour |
|---|---|
"chunk" |
Emits one DataSample per text chunk (output = chunk text) |
"qa" / "preference" / "grpo" / "multiturn" |
Internally runs an LLM generation pass and emits generated samples directly |
Chunks use a heading-aware strategy with configurable chunk_max_tokens (default 512) and chunk_overlap_tokens (default 50). Each chunk sample carries metadata["chunk_index"], metadata["page"], metadata["parent_heading"], and metadata["source_file"] — these are forwarded to generated samples so chunk-level provenance survives generation.
Format detection system¶
curatorkit/detection/detector.py — FormatDetector
Layer 1 — column-set candidate generation
Uses semantic equivalence classes rather than exact string matches:
- INSTRUCTION_COLS: instruction, prompt, query, question, input, ...
- OUTPUT_COLS: output, response, completion, answer, ...
- CHOSEN_COLS: chosen, preferred, accepted, response_a, ...
- REJECTED_COLS: rejected, refused, dispreferred, response_b, ...
Generates all matching candidates in priority order. Does not stop at first match.
Layer 2 — value-type validation
Validates actual cell values against the expected type for each candidate format.
A sharegpt candidate requires the conversations column value to be list[dict],
not a flat string. Candidates that fail layer 2 are dropped; the next candidate is tried.
Layer 3 — role alias normalisation
Applied during DataSample construction after format is committed.
Normalises from/value → role/content and expands role aliases:
- human, user, input → "user"
- gpt, assistant, model, output → "assistant"
- system → "system"
Detection result confidence¶
| Level | Meaning |
|---|---|
HIGH |
Exact or near-exact column-set match, layer 2 passed |
MEDIUM |
Partial match or layer 2 passed without sample rows |
LOW |
Best guess only — pipeline warns, emits provenance note |
UNKNOWN |
No candidate survived — rows become RejectedSample |
SchemaGate¶
SchemaGate is task-type-aware.
Default field validation per task_type:
| task_type | Fields checked |
|---|---|
| SFT family | instruction + output non-empty |
| preference / implicit_preference | chosen + rejected non-empty |
| grpo | instruction + responses (non-empty list) |
| language_modeling | output non-empty |
| prompt_only | instruction non-empty; output may be empty |
| source_chunk | input non-empty; instruction/output generated later |
Token counting is also task-type-aware:
- SFT: instruction + output
- Preference: instruction + the longer of chosen / rejected
- GRPO: instruction + longest response
- Pretrain: output
- Source chunk: input
LLM abstraction layer¶
curatorkit/llm/ provides a provider-agnostic LLM interface.
BaseLLM (ABC)
.generate(messages, **kwargs) -> LLMResponse # sync
.agenerate(messages, **kwargs) -> LLMResponse # async
.config_hash() -> str # stable hash for provenance
.temperature: float # mutable default; callers can override per call
LiteLLMBackend → routes to OpenAI, Anthropic, Gemini, Mistral, vLLM, etc.
OllamaBackend → routes to local Ollama server
LLMResponse carries .text, .model, .usage (token counts), and .to_provenance_dict() which serialises model name, token usage, and finish reason into the ProvenanceRecord.notes dict.
Generation tasks¶
curatorkit/generators/ — all tasks extend BaseGenerationTask(BaseNormalizer).
BaseGenerationTask
._build_messages(sample) -> list[dict] # abstract — build the LLM prompt
._parse_response(sample, response) -> list[DataSample] # abstract — parse output
.run(samples) -> list[DataSample] # sync, one LLM call per sample
.run_async(samples) -> list[DataSample] # async, concurrency-controlled
.flush_rejected() -> list[RejectedSample] # return and clear accumulated failures
Parse failure handling: if _parse_response() returns [] (parse failed), the source sample is emitted as a RejectedSample with rejection_reason="generation_parse_failed:{task_name}". No silent drops.
Available generation tasks¶
| Task | Class | Output task_type |
Notes |
|---|---|---|---|
| QA generation | QAGenerationTask |
instruction_following |
Stores source chunk in sample.input; forwards chunk_index, page, parent_heading to metadata |
| Preference pairs | PreferenceGenerationTask |
preference |
mode="single_call" (1 LLM call) or mode="two_pass" (2 calls, higher quality contrast) |
| GRPO rollouts | GRPORolloutTask |
grpo |
Generates N responses per prompt; optional reward scoring via separate LLM call |
| Multi-turn | MultiTurnTask |
conversational |
Generates N turn pairs in a simulated dialogue |
| Evol-Instruct | EvolInstructTask |
instruction_following |
Rewrites instructions at increasing complexity; optionally generates answers |
| Chain-of-thought | ChainOfThoughtTask |
instruction_following |
mode="generate" (new CoT) or mode="wrap" (wrap existing answer) |
| Adversarial preference | AdversarialPreferenceTask |
preference |
Faithful chosen + adversarially corrupted rejected; corruption controlled by injection_rate / injection_types |
| Adversarial QA | AdversarialQAGenerationTask |
instruction_following |
QA pairs with a controlled fraction of injected failure modes; marks metadata["injected_failure"] and metadata["injection_type"] |
Each generated sample carries a ProvenanceRecord with the LLM's model name, token usage, prompt hash, and source sample ID.
Quality gates¶
Quality gates sit after the generation step and filter on content rather than structure.
HallucinationGate¶
curatorkit/gates/hallucination.py
Uses an LLM judge to verify that each generated answer is grounded in its source chunk. The source is read from sample.input — the exact chunk text stored by QAGenerationTask. This is provenance-exact: the judge always sees the same chunk that generated the sample, not a retrieved approximation.
Scoring prompt asks the judge to rate grounding on a 0–1 scale. Samples below threshold (default 0.7) become RejectedSample with rejection_reason="hallucination_contract_failed:{score:.2f}". Samples with no source context pass through silently by default (skip_if_no_context=True); with skip_if_no_context=False they are rejected with rejection_reason="hallucination_gate:no_source_context".
The ProvenanceRecord includes grounding_score, unsupported_claims, and the judge verdict — the DiagnosticProbe reads these for score-conditioned routing and fallback classification.
Probe attachment point: when enable_diagnostic_probe=True, Curator._build_steps() sets gate.probe = DiagnosticProbe(...) on both the HallucinationGate and the RewardGate instances. The pipeline's gate block checks getattr(step, "probe", None) and calls probe.diagnose_batch(rejected) on each gate's rejections — see the gate block below.
RewardGate¶
curatorkit/gates/reward.py
Uses an LLM judge to score sample quality across configurable dimensions (default: helpfulness, honesty, instruction_following). Scores are averaged; samples below threshold are rejected. The judge does not receive the source chunk — this is purely instruction-following quality, not faithfulness.
DiversityGate¶
curatorkit/gates/diversity.py
Embedding-based near-duplicate filter. Uses sentence-transformers to encode each sample and reject those with cosine similarity above similarity_threshold against already-accepted samples in the current batch.
Cross-run embedding deduplication¶
curatorkit/normalizers/embedding_dedup.py — EmbeddingDeduplicator
Extends within-batch deduplication with a persistent embedding index that survives across pipeline runs. The index is stored as embeddings.npy + metadata.json in embedding_index_dir.
- Cross-run dedup: new samples are checked against the full index before being checked against each other.
- Within-run dedup: samples in the current batch that pass the index check are still deduplicated against each other.
- FAISS-optional: uses FAISS for ANN if installed (
pip install "curatorkit[embedding-faiss]"); falls back to NumPy brute-force otherwise. - The
text_field="auto"mode picks the embedding target based ontask_type(e.g.chosenfor preference,instruction + outputfor SFT).
Pipeline orchestration¶
curatorkit/pipeline.py — Pipeline
Pipeline(steps, output_dir, diagnostics=None)
.run() -> PipelineResult # sync
.run_async() -> PipelineResult # async
The runner dispatches on step type. Generation tasks (BaseGenerationTask subclasses) are BaseNormalizer instances with an additional run_async() and flush_rejected(). The pipeline calls flush_rejected() after each generation step and merges the failures into all_rejected.
Gate block — with diagnostic probe¶
Both run() and run_async() contain identical gate blocks. When a gate has a probe attribute attached, the block diagnoses every rejection in one concurrent batch and merges probe-recovered samples back into the stream:
passed, rejected = gate.run(samples)
probe = getattr(step, "probe", None)
probe_recovered = []
if probe is not None and rejected:
diagnoses = probe.diagnose_batch(rejected) # ← concurrent inline probe
for r, diag in zip(rejected, diagnoses):
r.diagnosis = diag
if self._diagnostics is not None:
self._diagnostics.record(r)
if diag.recovered_sample is not None:
probe_recovered.append(diag.recovered_sample)
all_rejected.extend(rejected)
samples = passed + probe_recovered # recovered samples flow into the next step
Pipelines without a probe are unaffected — getattr(step, "probe", None) returns None and the block is skipped. Each gate's stage_counts entry records the number of probe-recovered samples.
PipelineResult¶
@dataclass
class PipelineResult:
passed: list[DataSample]
rejected: list[RejectedSample]
stage_counts: dict[str, dict[str, int]]
wall_clock_seconds: float
diagnostics: PipelineDiagnostics | None # None when probe is inactive
stage_counts has one entry per step: {input_count, output_count, probe_recovered, rejected_count} for gates; {output_count, rejected_count} for readers; {input_count, output_count} for normalizers (generation tasks add rejected_count); {exported_count} for exporters.
CuratorConfig and Curator entry point¶
curatorkit/curator.py
CuratorConfig is a flat @dataclass (not Pydantic). All pipeline behaviour is driven by its fields — no nested sub-objects.
Key config field groups¶
| Group | Fields |
|---|---|
| Source | dataset, split, subset, streaming, max_samples |
| LLM | llm_model, llm_temperature, llm_max_tokens, llm_api_key, llm_api_base, llm_concurrency |
| Generation | generation_task, num_questions, num_responses, num_turns, num_evolutions, difficulty, preference_mode, cot_mode |
| Quality gates | hallucination_threshold, reward_threshold, reward_dimensions, diversity_threshold |
| Cross-run dedup | embedding_dedup, embedding_index_dir, embedding_dedup_threshold, embedding_model |
pdf_output_mode, pdf_extract_tables |
|
| Adaptive recovery | enable_diagnostic_probe, probe_temperatures, probe_score_split, probe_generator_model, probe_extra_templates, enable_reward_refiner |
| Export | output_dir, export_formats |
apply_patch()¶
CuratorConfig.apply_patch(patch: dict) -> CuratorConfig returns a shallow copy with only the patched fields changed. The original config is never mutated. Used by AdaptivePass2Runner (an offline utility — see the diagnostic module section) to apply config patches when re-running rejected samples:
| Patch key | Config field modified |
|---|---|
llm_temperature |
config.llm_temperature |
prompt_template |
config.llm_prompt_template |
context_window |
consumed by AdaptivePass2Runner directly, not stored |
regenerate_field |
consumed by AdaptivePass2Runner directly, not stored |
Curator._build_steps() — step construction order¶
1. Readers (one per dataset in config.dataset)
2. SchemaGate
3. ExactDeduplicator (if dedup != "none")
4. MinHashDeduplicator (if dedup == "minhash")
5. TextCleaner (if clean=True)
── Data hygiene (pre-generation) ──────────────────────────────
6. SecretsGate (if secrets_gate=True)
7. PIIPseudonymizer (if pii_pseudonymize=True)
8. ToxicityGate (if toxicity_gate=True)
────────────────────────────────────────────────────────────────
9. Generation task (if generation_task + llm_model set)
10. HallucinationGate (if hallucination_threshold is not None)
11. RewardGate (if reward_threshold set)
└── DiagnosticProbe attached to gates 10 and 11 (if enable_diagnostic_probe=True)
12. DiversityGate (if diversity_threshold set)
13. EmbeddingDeduplicator (if embedding_dedup=True)
14. StratifiedSampler (if resample=True)
15. MaxSamplesTruncator (if max_samples set) — applied after resampling so the final distribution is preserved
16. Exporters (one per format in export_formats)
Curator._build_llm() routes to OllamaBackend for ollama/ prefixed models and LiteLLMBackend for everything else.
When enable_reward_refiner=True, a RewardRefiner is held on the Curator instance (it is not a pipeline step) and runs post-pipeline on RewardGate rejections.
Diagnostic module (adaptive recovery)¶
curatorkit/diagnostic/ — diagnoses why a gate rejected a sample and attempts an inline, targeted repair before the sample is declared lost.
FailureMode taxonomy¶
curatorkit/diagnostic/failure_modes.py defines nine modes:
FailureMode (9 modes)
├── HallucinationGate causes
│ ├── SOURCE_AMBIGUOUS source/response relationship unclear
│ ├── GENERATOR_TEMPERATURE high temperature causes source drift
│ ├── GENERATOR_PARAMETRIC model ignores source, uses prior knowledge
│ └── THRESHOLD_MARGINAL score just below threshold, unstable
├── RewardGate causes
│ ├── INSTRUCTION_QUALITY generated question is poor quality
│ ├── RESPONSE_QUALITY generated answer is poor quality
│ └── DOMAIN_MISMATCH generation prompt wrong for this domain
├── DiversityGate cause
│ └── NEAR_DUPLICATE too similar to an already-accepted sample
└── Fallback
└── UNKNOWN probe inconclusive
The same module defines PROMPT_TEMPLATES — the named re-generation prompts (strict_grounding, domain_specific, generate_question, default) — and the FailureDiagnosis dataclass attached to RejectedSample.diagnosis:
| Field | Meaning |
|---|---|
mode |
Diagnosed FailureMode |
evidence |
Temperature-sweep pass/fail pattern, e.g. [True, False] |
probe_calls |
Total LLM calls consumed by the probes |
notes |
Extra info (e.g. which prompt variant succeeded) |
recovered_sample |
The passing re-generation, or None if all probes were exhausted |
Recovery is inline: when a probe path produces a re-generation that passes the gate, the passing sample is stored in recovered_sample and the pipeline routes it forward immediately. There is no separate second pass.
DiagnosticProbe¶
curatorkit/diagnostic/probe.py
Score-conditioned probe sequence, run by the pipeline gate block via diagnose_batch(). Routing is conditioned on the grounding_score the HallucinationGate wrote to the sample's provenance chain:
grounding_score >= score_split (default 0.5) — near-boundary path
Probe 1 — temperature sweep (default [0.3, 0.5]) max 2 calls
all pass → THRESHOLD_MARGINAL
low-T pass only → GENERATOR_TEMPERATURE
mixed → THRESHOLD_MARGINAL
all fail → continue to prompt variants
Probe 2a — strict_grounding prompt max 1 call
passes gate → GENERATOR_PARAMETRIC
Probe 2b — domain prompt (domain_specific, or the template max 1 call
named by metadata["domain_prompt_key"])
passes gate → DOMAIN_MISMATCH
Probe 2c — instruction re-generation (generate_question) max 1 call
passes gate → INSTRUCTION_QUALITY
grounding_score < score_split — clearly-failing path
Probe 2a (strict_grounding) runs first, then the temperature
sweep, then the remaining prompt variants.
All probes exhausted → SOURCE_AMBIGUOUS (provenance notes present)
or UNKNOWN (no provenance notes at all)
Worst case: 5 LLM calls per rejected sample (2 temperature sweep + 3 prompt variants). DPO pairs rejected with rejected_above_threshold exit immediately as RESPONSE_QUALITY with 0 calls — re-generating the chosen answer cannot fix an insufficient quality contrast.
diagnose() never raises — it catches all exceptions and returns UNKNOWN. diagnose_batch(rejected, concurrency=32) diagnoses samples in parallel; each sample's probe sequence stays sequential internally because each result conditions the next probe.
Source context is read from rejected.input (the exact chunk text stored by the generation task). Context extension with adjacent chunks is deliberately excluded: generator and judge operate on the same fixed context, so cross-chunk probing would produce coincidental rather than causal diagnoses.
Every recovered sample receives a DiagnosticProbe ProvenanceRecord recording the successful probe path and the diagnosed mode, so the provenance chain shows exactly how the sample was recovered.
RewardRefiner¶
curatorkit/diagnostic/reward_refine.py
Targeted single-retry recovery for RewardGate rejections, run post-pipeline by Curator when enable_reward_refiner=True. For each rejected sample it reads the lowest-scoring dimension and the judge's weakness note from provenance, prompts the generator to rewrite the answer to improve that specific dimension (one LLM call per sample, concurrent), then re-evaluates all candidates in a single bulk RewardGate call. With refine_instruction=True it also rewrites poorly formed questions. rejected_above_threshold pairs are skipped. Refined samples carry metadata["reward_refined"], metadata["refinement_axis"], metadata["refinement_type"], and a RewardRefiner provenance record; for DPO pairs the refined answer becomes the new chosen while the original rejected response is preserved.
PipelineDiagnostics¶
curatorkit/diagnostic/diagnostics.py
Run-level accumulator attached to Pipeline when enable_diagnostic_probe=True. Provides:
- mode_counts() — dict[str, int] of failure mode frequencies
- probe_recovery_count() — number of samples where the probe produced an inline passing re-generation
- total_probe_calls() — total LLM calls consumed by all probes
- to_dict() / write_summary(path) — summary dict / writes diagnostic_summary.json
Accessible as CuratorResult.diagnostics after a run.
Offline utilities¶
Two modules in curatorkit/diagnostic/ are not used by production pipelines:
pass2.py—AdaptivePass2Runnerre-runs diagnosed rejections against a gate with an explicitforce_patchconfig patch (applied viaCuratorConfig.apply_patch()), for offline what-if analysis of repair strategies. Returns aPass2Resultwithaccepted,rejected,conversion_rate, and per-mode rates insummary; writesrejected_pass2.jsonlandpass2_summary.jsontooutput_dir.retriever.py—PostHocRetrieverretrieves the most similar chunk for a query by embedding cosine similarity, for comparing exact provenance tracking against post-hoc retrieval.
Provenance manifest¶
curatorkit/manifest.py
manifest.json is always written after a run. Key fields:
| Field | Description |
|---|---|
pipeline_config_hash |
SHA-256 of key config fields |
run_timestamp |
ISO-8601 UTC |
stage_counts |
Per-step input / output / rejected counts |
rejected_breakdown |
Rejection reason → count |
dedup_stats |
Extracted from MinHashDeduplicator provenance |
diagnostic_stats |
PipelineDiagnostics.to_dict() — null when the probe is inactive |
diagnostic_files |
["diagnostic_summary.json"] when the probe is active, else [] |
tool_versions |
curatorkit and python version strings |
rejected.jsonl is always written (even when empty). Each line is a RejectedSample serialised via model_dump_json(). When a diagnosis is present it appears as a nested diagnosis object with mode, was_recovered, evidence, probe_calls, notes.
Checksums (checksums.txt) cover both *.jsonl and *.json output files.
Exporter registry¶
| Exporter | Output file | Format |
|---|---|---|
AlpacaExporter |
sft_alpaca.jsonl |
{instruction, input, output} |
ShareGPTExporter |
sft_sharegpt.jsonl |
{conversations: [{from, value}]} |
DPOExporter |
dpo.jsonl |
{prompt, chosen, rejected} |
GRPOExporter |
grpo.jsonl |
{prompt, responses, rewards} |
PPOExporter |
ppo.jsonl |
{prompt} |
CorpusExporter |
corpus.jsonl |
Raw corpus chunks with full chunk metadata and provenance |
YAML schema reference¶
Validated by PipelineConfig in curatorkit/config.py (pydantic) at CLI startup — a malformed config fails immediately with a clear error, not mid-pipeline. See CLI and YAML for the complete per-key reference.
name: curatorkit_pipeline
version: "0.2.0"
max_samples: null # cap on total samples, applied right after readers
output_split: null # e.g. {train: 0.8, val: 0.1, test: 0.1}; fractions sum to 1.0
output_split_seed: 42
readers:
- type: jsonl | json | csv | parquet | huggingface | pdf
path: data/file.jsonl
format: auto | alpaca | sharegpt | preference | implicit_preference |
unpaired_preference | grpo | prompt_only | pretrain
field_mapping:
source_key: canonical_key
nested.source.key: canonical_key
preprocessing_fn: "mymodule.my_fn"
detection_sample_size: 10
source_uri: null # provenance override
# JSON-specific
json_data_key: null
# CSV-specific
csv_delimiter: null
csv_parse_json_cells: true
# Parquet-specific
parquet_columns: null
parquet_batch_size: 1000
# HuggingFace-specific
hf_split: train
hf_subset: null
hf_streaming: false
hf_token: "${HF_TOKEN}"
hf_columns: null
# PDF-specific
chunk_strategy: heading | sentence | fixed
chunk_max_tokens: 512
chunk_overlap_tokens: 50
extract_tables: false
ocr: false
min_section_tokens: 30
output_mode: chunk | qa | preference | grpo | multiturn
llm_model: null # per-reader LLM override
llm: # global LLM config, shared by generators and gates
model: openai/gpt-4o-mini
temperature: 0.7
max_tokens: 1024
api_key: null
api_base: null
concurrency: 10
timeout: 120.0
max_retries: 3
drop_params: true
extra_body: {}
generators:
- type: qa | evol_instruct | preference | grpo | multiturn | cot |
adversarial_preference | adversarial_qa
# qa
num_questions: 3
difficulty: medium
# evol_instruct
num_evolutions: 1
generate_answers: true
# preference
preference_mode: single_call | two_pass
# grpo
num_responses: 4
score_responses: true
temperature_spread: 0.6
# multiturn
num_turns: 3
# cot
cot_mode: generate | wrap
# adversarial_preference / adversarial_qa
injection_rate: 0.5
injection_types: [] # empty = all types
injection_seed: 42
high_temp: 1.4
prompt_template: null # custom prompt override
llm_model: null # per-generator LLM override
gates:
- type: schema
min_tokens: 10
max_tokens: 2048
use_tiktoken: false
# ── Hygiene gates (before generators) ──
- type: secrets
secrets_code_corpus_mode: false # true for code datasets
- type: toxicity
toxicity_classifier_pass_threshold: 0.1
toxicity_classifier_reject_threshold: 0.5
toxicity_detoxify_model: unbiased
# toxicity_llm_model: openai/gpt-4o-mini # enable Stage 2 LLM judge for borderline
# ── Generation-time gates ─────────────────────
- type: hallucination
hallucination_threshold: 0.7
skip_if_no_context: true
hallucination_llm_model: null # override global LLM
- type: reward
reward_threshold: 0.7
reward_dimensions: [helpfulness, honesty, instruction_following]
store_score_in_label: true
reward_llm_model: null # override global LLM
- type: diversity
similarity_threshold: 0.92
embedding_model: sentence-transformers/all-MiniLM-L6-v2
embedding_device: null # cuda | cpu | mps | null (auto)
normalizers:
- type: exact_dedup | minhash_dedup | text_cleaner | stratified_sampler |
embedding_dedup | pii_pseudonymizer
# PIIPseudonymizer (hygiene — after SecretsGate, before generators)
- type: pii_pseudonymizer
pii_score_threshold: 0.7
pii_spacy_model: en_core_web_lg
pii_faker_seed: 42
# pii_entity_types: [] # empty = default types; extend for clinical/legal
- type: embedding_dedup
embedding_index_dir: output/embedding_index
embedding_threshold: 0.92
diagnostic: # diagnostic probe (opt-in)
enable_probe: true
probe_temperatures: [0.3, 0.5]
probe_generator_model: null # null = gate's LLM model
score_split: 0.5
extra_templates: {} # override built-in probe prompt templates
exporters:
- type: alpaca | sharegpt | dpo | grpo | ppo | corpus
Output file reference¶
| File | Always written | Description |
|---|---|---|
manifest.json |
yes | Pipeline config hash, stage counts, rejection breakdown |
rejected.jsonl |
yes | All RejectedSample objects (includes diagnosis when probe active) |
checksums.txt |
yes | SHA-256 for all .jsonl and .json output files |
dataset_card.md |
yes | Human-readable pipeline summary |
sft_alpaca.jsonl |
if exported | Alpaca-format SFT data |
sft_sharegpt.jsonl |
if exported | ShareGPT-format SFT data |
dpo.jsonl |
if exported | DPO preference data |
grpo.jsonl |
if exported | GRPO rollout data |
ppo.jsonl |
if exported | PPO prompt data |
corpus.jsonl |
if exported | Raw corpus chunks with provenance metadata |
diagnostic_summary.json |
when probe active | Mode counts, recovery counts, probe call totals |
rejected_pass2.jsonl |
offline utility only | Written by AdaptivePass2Runner — samples that failed the gate re-check |
pass2_summary.json |
offline utility only | Written by AdaptivePass2Runner — per-mode conversion rates |