Skip to content

Data models

Every sample flowing through a pipeline is a DataSample; rejections are wrapped in RejectedSample; each step appends an immutable ProvenanceRecord.

curatorkit.schema.DataSample

Bases: BaseModel

The canonical unit of data moving through the pipeline.

task_type vocabulary (use these strings in YAML and code): instruction_following — single-turn SFT (Alpaca family) conversational — multi-turn SFT (ShareGPT / ChatML) preference — DPO with explicit chosen/rejected + optional prompt implicit_preference — DPO where prompt is embedded inside chosen/rejected turns unpaired_preference — single completion with scalar quality label grpo — group rollouts with reward scores prompt_only — PPO-style: prompt only, response generated at runtime language_modeling — continued pre-training, full text sequences

Field usage by task_type

instruction_following → instruction, input (optional), output conversational → instruction (first human turn), output (first assistant turn), metadata["turns"] for subsequent turns, metadata["system_prompt"] if present preference → instruction (prompt), chosen, rejected implicit_preference → chosen, rejected (instruction extracted from common prefix) unpaired_preference → instruction (prompt), output (completion), label grpo → instruction (prompt), responses, reward_scores prompt_only → instruction (prompt) language_modeling → output (full text sequence)

curatorkit.schema.RejectedSample

Bases: DataSample

A DataSample that could not be parsed or failed a gate check.

rejection_reason is a structured string

"missing_field:{field}" "json_decode_error:{msg}" "format_mismatch:{detail}" "unrecognized_format:{detail}" "preprocessing_fn_error:{msg}" "low_confidence_format:{candidate}" "encoding_error:{detail}" "below_min_tokens:{count}" "above_max_tokens:{count}"

curatorkit.schema.ProvenanceRecord

Bases: BaseModel

Immutable record appended by each pipeline step.