Exporters¶

Exporters write accepted samples to disk. Multiple formats can be written in a single run. Specify them in export_formats:

CuratorConfig(
    export_formats = ["alpaca", "sharegpt", "dpo"],
    output_dir     = "output/",
)

Exporter reference¶

Format key	Output file	JSON structure
`alpaca`	`sft_alpaca.jsonl`	`{"instruction": "...", "input": "...", "output": "..."}`
`sharegpt`	`sft_sharegpt.jsonl`	`{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}`
`dpo`	`dpo.jsonl`	`{"prompt": "...", "chosen": "...", "rejected": "..."}`
`grpo`	`grpo.jsonl`	`{"prompt": "...", "responses": [...], "rewards": [...]}`
`ppo`	`ppo.jsonl`	`{"prompt": "..."}`
`corpus`	`corpus.jsonl`	Full source chunk with metadata (page, heading, chunk_index, ...)

All exporters overwrite files of the same name in output_dir on every run — they do not append. To keep the results of a previous run, point each run at a fresh output_dir (or move the files elsewhere before re-running).

Compatibility matrix¶

Not all exporters handle all task types. Incompatible samples are silently skipped (e.g. DPOExporter skips samples without chosen/rejected).

Task	alpaca	sharegpt	dpo	grpo	ppo	corpus
`qa`	✓	✓	—	—	—	—
`preference`	✓ (chosen)	—	✓	—	—	—
`grpo`	—	—	—	✓	✓	—
`multiturn`	✓ (1st turn)	✓	—	—	—	—
`evol`	✓	✓	—	—	—	—
`cot`	✓	✓	—	—	—	—
`adversarial_preference`	—	—	✓	—	—	—
`adversarial_qa`	✓	✓	—	—	—	—
Source chunks only	—	—	—	—	—	✓

It is safe to include dpo in export_formats for mixed-task pipelines — it simply won't write rows for non-preference samples.

Format details¶

Alpaca (`sft_alpaca.jsonl`)¶

Standard three-field SFT format compatible with most fine-tuning frameworks.

{"instruction": "What is backpropagation?", "input": "", "output": "Backpropagation is..."}

input is the source context (populated for corpus-grounded samples). Empty string when not applicable.

ShareGPT (`sft_sharegpt.jsonl`)¶

Conversation format. For multi-turn samples, the full turn list is encoded from metadata['turns']:

{
  "conversations": [
    {"from": "human", "value": "What does the study reveal about..."},
    {"from": "gpt",   "value": "The study reveals that..."},
    {"from": "human", "value": "Can you elaborate on the methodology?"},
    {"from": "gpt",   "value": "The methodology involved..."}
  ]
}

For non-multi-turn samples, produces a two-turn conversation (human + gpt).

DPO (`dpo.jsonl`)¶

{"prompt": "What is...", "chosen": "A thorough answer citing...", "rejected": "A vague answer that omits..."}

For conversational preference pairs, prompt and both responses are encoded as turn lists.

GRPO (`grpo.jsonl`)¶

{
  "prompt": "Explain the significance of...",
  "responses": ["Answer A...", "Answer B...", "Answer C..."],
  "rewards": [0.82, 0.61, 0.74]
}

rewards is populated only when score_responses=True (the default).

PPO (`ppo.jsonl`)¶

Prompt-only format for PPO rollout collection:

{"prompt": "Explain the significance of..."}

Corpus (`corpus.jsonl`)¶

Full metadata preservation for source chunks:

{
  "text": "The document establishes four categories...",
  "source_uri": "docs/handbook.pdf",
  "page": 3,
  "heading": "Section 2 — Categories",
  "chunk_index": 12,
  "content_type": "text",
  "task_type": "language_modeling"
}

Train / val / test splits¶

Set output_split to split accepted samples across separate subdirectories. Fractions must sum to 1.0.

CuratorConfig(
    export_formats = ["alpaca", "dpo"],
    output_split   = {"train": 0.8, "val": 0.1, "test": 0.1},
    output_dir     = "output/",
)

Output structure:

output/
  train/
    sft_alpaca.jsonl
    dpo.jsonl
  val/
    sft_alpaca.jsonl
    dpo.jsonl
  test/
    sft_alpaca.jsonl
    dpo.jsonl

Samples are shuffled before splitting. The last split receives any remainder from rounding.

Output directory layout¶

Every run writes these files regardless of which exporters are configured:

output/
  manifest.json             Pipeline config hash, stage counts, rejection breakdown
  rejected.jsonl            All rejected samples with structured reasons
  dataset_card.md           Human-readable run summary
  checksums.txt             SHA-256 for all output files
  diagnostic_summary.json   Failure mode counts, recovery stats (when probe active)
  [format files]            Only the formats you listed in export_formats

manifest.json and rejected.jsonl are always written and cannot be disabled. They are the primary audit trail for the run.

Exporters¶

Exporter reference¶

Compatibility matrix¶

Format details¶

Alpaca (sft_alpaca.jsonl)¶

ShareGPT (sft_sharegpt.jsonl)¶

DPO (dpo.jsonl)¶

GRPO (grpo.jsonl)¶

PPO (ppo.jsonl)¶

Corpus (corpus.jsonl)¶