Interfaces¶
The four abstract base classes every pipeline stage implements. Subclass these to add components; the contributing guide covers the process.
curatorkit.interfaces ¶
Abstract base classes for all pipeline components.
BaseReader.read() returns the same tuple contract as BaseGate.run() — (passed, rejected). This makes reader-level parse failures first-class citizens that flow into rejected.jsonl rather than disappearing.
Every pipeline step implements one of these four interfaces.
BaseReader ¶
Bases: ABC
Reads raw data from a source and produces DataSample objects.
MUST return both lists. Parse failures that produce no DataSample must be returned as RejectedSample objects — never silently dropped. Every line/row that enters the reader must leave as either a DataSample or a RejectedSample.
BaseGate ¶
Bases: ABC
Validates samples against a contract.
MUST return both lists. Silent drops are bugs. Every sample that does not pass goes into the rejected list as a RejectedSample with a reason string.
BaseNormalizer ¶
Bases: ABC
Transforms samples in-place (dedup, cleaning, sampling).
Returns the transformed list. Samples removed by a normalizer (e.g. dedup) are NOT added to the rejected list — removal is intentional, not a contract violation. The normalizer must record removal counts in a ProvenanceRecord appended to surviving samples.
BaseExporter ¶
Bases: ABC
Serialises DataSample objects to a training-ready format on disk.