Format detection¶

Automatic dataset-format detection used by the readers when format="auto".

curatorkit.detection.detector ¶

FormatDetector — three-layer dataset format detection engine.

Layer 1: Column-set candidate generation using semantic equivalence classes. Layer 2: Value-type validation per candidate. Layer 3: Role alias normalization (applied during sample construction, not here).

Design

Layer 1 generates ALL matching candidates, ranked by confidence.
Layer 2 validates each candidate in rank order; first passing candidate wins.
Unknown format after all candidates fail → emit RejectedSample upstream.
Confidence tiers drive how pipeline reports and handles ambiguous inputs.

DetectionResult `dataclass` ¶

DetectionResult(format: DataFormat, confidence: DetectionConfidence, column_map: dict[str, str | None] = dict(), warnings: list[str] = list(), unrecognized_cols: list[str] = list())

The resolved detection output.

column_map maps semantic slot names to the actual column names found in the dataset. A None value means the slot is absent.

Semantic slots

instruction, output, context, conversation, chosen, rejected, label, system, responses, rewards, text

FormatDetector ¶

Three-layer format detection engine.

Usage

detector = FormatDetector() result = detector.detect(keys, sample_rows)

keys is the set of top-level column names from the dataset. sample_rows is a small list of raw dicts (first N rows) used for layer 2 value-type validation. Pass an empty list to skip layer 2 (layer 1 candidates are returned as-is with their base confidence).

detect_with_consistency_check ¶

detect_with_consistency_check(all_rows_keys: list[set[str]], sample_rows: list[dict[str, Any]]) -> tuple[DetectionResult, bool]

Run detection and also check whether the column set is consistent across the sampled rows. Returns (result, is_consistent).

is_consistent=False means the file may have mixed schemas — the pipeline should warn but not hard-fail.

Format detection¶

curatorkit.detection.detector ¶

DetectionResult dataclass ¶

FormatDetector ¶

detect_with_consistency_check ¶

DetectionResult `dataclass` ¶