Manifest & dataset card¶
The provenance artifacts written after every run: the manifest builder, the rejected-sample sidecar, checksums, and the dataset card generator.
curatorkit.manifest ¶
ProvenanceManifest and DatasetCardGenerator.
manifest.json and dataset_card.md are ALWAYS emitted after a pipeline run. They cannot be disabled via config. Provenance is a design constraint, not an optional feature.
manifest.json top-level keys:
pipeline_config_hash SHA-256 of the full pipeline YAML config
run_timestamp ISO-8601 UTC
source_files list of {path, sha256}
stage_counts {step_name: {input_count, output_count, rejected_count}}
rejected_breakdown {rejection_reason: count}
dedup_stats extracted from MinHashDeduplicator provenance notes
minhash_threshold float | null
wall_clock_seconds float
tool_versions {curatorkit: "