Skip to content

Manifest & dataset card

The provenance artifacts written after every run: the manifest builder, the rejected-sample sidecar, checksums, and the dataset card generator.

curatorkit.manifest

ProvenanceManifest and DatasetCardGenerator.

manifest.json and dataset_card.md are ALWAYS emitted after a pipeline run. They cannot be disabled via config. Provenance is a design constraint, not an optional feature.

manifest.json top-level keys: pipeline_config_hash SHA-256 of the full pipeline YAML config run_timestamp ISO-8601 UTC source_files list of {path, sha256} stage_counts {step_name: {input_count, output_count, rejected_count}} rejected_breakdown {rejection_reason: count} dedup_stats extracted from MinHashDeduplicator provenance notes minhash_threshold float | null wall_clock_seconds float tool_versions {curatorkit: "", python: "..."} diversity_stats reserved; currently null

ProvenanceManifest

ProvenanceManifest(result: PipelineResult, pipeline_config_hash: str = 'unknown', output_dir: Path | None = None)

Build and write the pipeline manifest after a run completes.

write_rejected_sidecar

write_rejected_sidecar() -> Path

Write rejected.jsonl — always written, even when empty.

write_checksums

write_checksums(output_files: list[Path]) -> Path

Write SHA-256 checksums for all output files.

DatasetCardGenerator

Generate a human-readable Markdown dataset card from a manifest.