Data models¶
Every sample flowing through a pipeline is a DataSample; rejections are wrapped in
RejectedSample; each step appends an immutable ProvenanceRecord.
curatorkit.schema.DataSample ¶
Bases: BaseModel
The canonical unit of data moving through the pipeline.
task_type vocabulary (use these strings in YAML and code): instruction_following — single-turn SFT (Alpaca family) conversational — multi-turn SFT (ShareGPT / ChatML) preference — DPO with explicit chosen/rejected + optional prompt implicit_preference — DPO where prompt is embedded inside chosen/rejected turns unpaired_preference — single completion with scalar quality label grpo — group rollouts with reward scores prompt_only — PPO-style: prompt only, response generated at runtime language_modeling — continued pre-training, full text sequences
Field usage by task_type
instruction_following → instruction, input (optional), output conversational → instruction (first human turn), output (first assistant turn), metadata["turns"] for subsequent turns, metadata["system_prompt"] if present preference → instruction (prompt), chosen, rejected implicit_preference → chosen, rejected (instruction extracted from common prefix) unpaired_preference → instruction (prompt), output (completion), label grpo → instruction (prompt), responses, reward_scores prompt_only → instruction (prompt) language_modeling → output (full text sequence)
curatorkit.schema.RejectedSample ¶
Bases: DataSample
A DataSample that could not be parsed or failed a gate check.
rejection_reason is a structured string
"missing_field:{field}" "json_decode_error:{msg}" "format_mismatch:{detail}" "unrecognized_format:{detail}" "preprocessing_fn_error:{msg}" "low_confidence_format:{candidate}" "encoding_error:{detail}" "below_min_tokens:{count}" "above_max_tokens:{count}"
curatorkit.schema.ProvenanceRecord ¶
Bases: BaseModel
Immutable record appended by each pipeline step.