Skip to content

Interfaces

The four abstract base classes every pipeline stage implements. Subclass these to add components; the contributing guide covers the process.

curatorkit.interfaces

Abstract base classes for all pipeline components.

BaseReader.read() returns the same tuple contract as BaseGate.run() — (passed, rejected). This makes reader-level parse failures first-class citizens that flow into rejected.jsonl rather than disappearing.

Every pipeline step implements one of these four interfaces.

BaseReader

Bases: ABC

Reads raw data from a source and produces DataSample objects.

MUST return both lists. Parse failures that produce no DataSample must be returned as RejectedSample objects — never silently dropped. Every line/row that enters the reader must leave as either a DataSample or a RejectedSample.

BaseGate

Bases: ABC

Validates samples against a contract.

MUST return both lists. Silent drops are bugs. Every sample that does not pass goes into the rejected list as a RejectedSample with a reason string.

BaseNormalizer

Bases: ABC

Transforms samples in-place (dedup, cleaning, sampling).

Returns the transformed list. Samples removed by a normalizer (e.g. dedup) are NOT added to the rejected list — removal is intentional, not a contract violation. The normalizer must record removal counts in a ProvenanceRecord appended to surviving samples.

BaseExporter

Bases: ABC

Serialises DataSample objects to a training-ready format on disk.