Skip to content

Format detection

Automatic dataset-format detection used by the readers when format="auto".

curatorkit.detection.detector

FormatDetector — three-layer dataset format detection engine.

Layer 1: Column-set candidate generation using semantic equivalence classes. Layer 2: Value-type validation per candidate. Layer 3: Role alias normalization (applied during sample construction, not here).

Design
  • Layer 1 generates ALL matching candidates, ranked by confidence.
  • Layer 2 validates each candidate in rank order; first passing candidate wins.
  • Unknown format after all candidates fail → emit RejectedSample upstream.
  • Confidence tiers drive how pipeline reports and handles ambiguous inputs.

DetectionResult dataclass

DetectionResult(format: DataFormat, confidence: DetectionConfidence, column_map: dict[str, str | None] = dict(), warnings: list[str] = list(), unrecognized_cols: list[str] = list())

The resolved detection output.

column_map maps semantic slot names to the actual column names found in the dataset. A None value means the slot is absent.

Semantic slots

instruction, output, context, conversation, chosen, rejected, label, system, responses, rewards, text

FormatDetector

Three-layer format detection engine.

Usage

detector = FormatDetector() result = detector.detect(keys, sample_rows)

keys is the set of top-level column names from the dataset. sample_rows is a small list of raw dicts (first N rows) used for layer 2 value-type validation. Pass an empty list to skip layer 2 (layer 1 candidates are returned as-is with their base confidence).

detect_with_consistency_check

detect_with_consistency_check(all_rows_keys: list[set[str]], sample_rows: list[dict[str, Any]]) -> tuple[DetectionResult, bool]

Run detection and also check whether the column set is consistent across the sampled rows. Returns (result, is_consistent).

is_consistent=False means the file may have mixed schemas — the pipeline should warn but not hard-fail.