Skip to content

Connectors

Readers for every supported source. All subclass BaseReader and emit DataSamples.

curatorkit.connectors

CSVReader

CSVReader(path: Path | str, delimiter: str | None = None, parse_json_cells: bool = True, field_mapping: dict[str, str] | None = None, format: str = 'auto', source_uri: str | None = None, preprocessing_fn: Callable | str | None = None, detection_sample_size: int = 10)

Bases: BaseConnector

Read a CSV or TSV file and produce DataSample objects.

Parameters:

Name Type Description Default
path Path | str

Path to the .csv or .tsv file.

required
delimiter str | None

Column delimiter. Auto-detected if None. Use '\t' for TSV files.

None
parse_json_cells bool

If True (default), attempt to parse cell values that look like JSON (lists, dicts, strings).

True
field_mapping dict[str, str] | None

Optional key renames applied before detection.

None
format str

Force a format ('auto' by default).

'auto'
source_uri str | None

Override for provenance records.

None
preprocessing_fn Callable | str | None

Callable or dotted import path.

None
detection_sample_size int

Rows to inspect before committing detection.

10

HuggingFaceReader

HuggingFaceReader(dataset_name: str, split: str = 'train', subset: str | None = None, streaming: bool = False, token: str | None = None, columns: list[str] | None = None, field_mapping: dict[str, str] | None = None, format: str = 'auto', source_uri: str | None = None, preprocessing_fn: Callable | str | None = None, detection_sample_size: int = 10)

Bases: BaseConnector

Load a HuggingFace dataset and produce DataSample objects.

Parameters:

Name Type Description Default
dataset_name str

HF Hub dataset name or local path. e.g. "tatsu-lab/alpaca" or "/path/to/dataset"

required
split str

Dataset split: "train", "test", "validation", etc. Defaults to "train".

'train'
subset str | None

Dataset configuration / subset name if applicable. e.g. "helpful-base" for Anthropic/hh-rlhf

None
streaming bool

If True, stream rows without full download. Disables consistency check (no random access).

False
token str | None

HF access token for gated datasets.

None
columns list[str] | None

Optional list of columns to load. If None, all columns.

None
field_mapping dict[str, str] | None

Optional key renames applied before detection.

None
format str

Force a format ('auto' by default).

'auto'
source_uri str | None

Override for provenance records. Defaults to "{dataset_name}/{split}".

None
preprocessing_fn Callable | str | None

Callable or dotted import path.

None
detection_sample_size int

Rows to inspect before committing detection.

10

JSONReader

JSONReader(path: Path | str, data_key: str | None = None, field_mapping: dict[str, str] | None = None, format: str = 'auto', source_uri: str | None = None, preprocessing_fn: Callable | str | None = None, detection_sample_size: int = 10)

Bases: BaseConnector

Read a .json file and produce DataSample objects.

Parameters:

Name Type Description Default
path Path | str

Path to the .json file.

required
data_key str | None

If the JSON is a dict wrapping a list, use this key to extract it. If None, auto-detect.

None
field_mapping dict[str, str] | None

Optional key renames applied before detection.

None
format str

Force a format ('auto' by default).

'auto'
source_uri str | None

Override for provenance records.

None
preprocessing_fn Callable | str | None

Callable or dotted import path.

None
detection_sample_size int

Rows to inspect before committing detection.

10

JSONLReader

JSONLReader(path: Path | str, field_mapping: dict[str, str] | None = None, format: str = 'auto', source_uri: str | None = None, preprocessing_fn: Callable | str | None = None, detection_sample_size: int = 10)

Bases: BaseConnector

Read a .jsonl file and produce DataSample objects.

Parameters:

Name Type Description Default
path Path | str

Path to the .jsonl file.

required
field_mapping dict[str, str] | None

Optional key renames applied before detection. Supports dot notation for nested source keys.

None
format str

Force a format ('auto' by default).

'auto'
source_uri str | None

Override for provenance records.

None
preprocessing_fn Callable | str | None

Callable or dotted import path. Signature: (dict) -> dict | DataSample | None

None
detection_sample_size int

Rows to inspect before committing detection.

10

ParquetReader

ParquetReader(path: Path | str, columns: list[str] | None = None, batch_size: int = 1000, field_mapping: dict[str, str] | None = None, format: str = 'auto', source_uri: str | None = None, preprocessing_fn: Callable | str | None = None, detection_sample_size: int = 10)

Bases: BaseConnector

Read a .parquet file and produce DataSample objects.

Parameters:

Name Type Description Default
path Path | str

Path to the .parquet file.

required
columns list[str] | None

Optional list of column names to load. If None, all columns are loaded.

None
batch_size int

Number of rows to read at a time. Default 1000.

1000
field_mapping dict[str, str] | None

Optional key renames applied before detection.

None
format str

Force a format ('auto' by default).

'auto'
source_uri str | None

Override for provenance records.

None
preprocessing_fn Callable | str | None

Callable or dotted import path.

None
detection_sample_size int

Rows to inspect before committing detection.

10