Skip to content

Reading the output

Every run writes the same provenance set regardless of configuration, plus one file per requested export format:

output/
  sft_alpaca.jsonl          SFT data in Alpaca format
  sft_sharegpt.jsonl        SFT data in ShareGPT conversation format
  dpo.jsonl                 DPO preference pairs (only when preference task used)
  grpo.jsonl                GRPO rollouts (only when grpo task used)
  ppo.jsonl                 PPO prompts (only when ppo exporter included)
  corpus.jsonl              Raw source chunks (only when corpus exporter included)
  rejected.jsonl            Every rejected sample with a structured reason string
  manifest.json             Pipeline config hash, stage counts, rejection breakdown
  dataset_card.md           Human-readable run summary
  checksums.txt             SHA-256 for all output files
  diagnostic_summary.json   Failure mode counts, recovery rate (when probe active)

manifest.json

Per-stage sample counts: how many entered, passed, and were rejected at each stage:

{
  "pipeline_config_hash": "a3f7c2d1",
  "stage_counts": {
    "SchemaGate":          {"input_count": 1000, "output_count": 980,  "probe_recovered": 0, "rejected_count": 20},
    "QAGenerationTask":    {"input_count": 980,  "output_count": 2850, "rejected_count": 90},
    "HallucinationGate":   {"input_count": 2850, "output_count": 2210, "probe_recovered": 0, "rejected_count": 640}
  }
}

Gates also record probe_recovered, the number of rejected samples the diagnostic probe repaired and returned to the pipeline. Exporter stages record a single exported_count.

rejected.jsonl

Every line carries a structured rejection_reason and the step that rejected it. In this example the hallucination gate's grounding check (its "contract") scored the answer 0.43, below the configured threshold:

{"id": "...", "instruction": "...", "rejection_reason": "hallucination_contract_failed:0.43", "rejecting_step": "HallucinationGate"}

Use these to tune thresholds, not to debug code.

Inspecting results in code

result.print_summary()
# ────────────────────────────────────────────
#   passed   :      2,210
#   rejected :        640
#   time     :      87.3s
#   output   : output/qa
# ────────────────────────────────────────────

result.sample(n=3)   # print first 3 passed samples

len(result.passed)   # list of DataSample
len(result.rejected) # list of RejectedSample

# per-stage counts
result.stage_counts["HallucinationGate"]
# → {"input_count": 2850, "output_count": 2210, "probe_recovered": 0, "rejected_count": 640}

Next: the guides cover each pipeline stage in depth.