Format detection¶
Automatic dataset-format detection used by the readers when format="auto".
curatorkit.detection.detector ¶
FormatDetector — three-layer dataset format detection engine.
Layer 1: Column-set candidate generation using semantic equivalence classes. Layer 2: Value-type validation per candidate. Layer 3: Role alias normalization (applied during sample construction, not here).
Design
- Layer 1 generates ALL matching candidates, ranked by confidence.
- Layer 2 validates each candidate in rank order; first passing candidate wins.
- Unknown format after all candidates fail → emit RejectedSample upstream.
- Confidence tiers drive how pipeline reports and handles ambiguous inputs.
DetectionResult
dataclass
¶
DetectionResult(format: DataFormat, confidence: DetectionConfidence, column_map: dict[str, str | None] = dict(), warnings: list[str] = list(), unrecognized_cols: list[str] = list())
The resolved detection output.
column_map maps semantic slot names to the actual column names found in the dataset. A None value means the slot is absent.
Semantic slots
instruction, output, context, conversation, chosen, rejected, label, system, responses, rewards, text
FormatDetector ¶
Three-layer format detection engine.
Usage
detector = FormatDetector() result = detector.detect(keys, sample_rows)
keys is the set of top-level column names from the dataset.
sample_rows is a small list of raw dicts (first N rows) used for
layer 2 value-type validation. Pass an empty list to skip layer 2
(layer 1 candidates are returned as-is with their base confidence).
detect_with_consistency_check ¶
detect_with_consistency_check(all_rows_keys: list[set[str]], sample_rows: list[dict[str, Any]]) -> tuple[DetectionResult, bool]
Run detection and also check whether the column set is consistent across the sampled rows. Returns (result, is_consistent).
is_consistent=False means the file may have mixed schemas — the pipeline should warn but not hard-fail.