Connectors¶
Readers for every supported source. All subclass BaseReader and emit DataSamples.
curatorkit.connectors ¶
CSVReader ¶
CSVReader(path: Path | str, delimiter: str | None = None, parse_json_cells: bool = True, field_mapping: dict[str, str] | None = None, format: str = 'auto', source_uri: str | None = None, preprocessing_fn: Callable | str | None = None, detection_sample_size: int = 10)
Bases: BaseConnector
Read a CSV or TSV file and produce DataSample objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to the .csv or .tsv file. |
required |
delimiter
|
str | None
|
Column delimiter. Auto-detected if None. Use '\t' for TSV files. |
None
|
parse_json_cells
|
bool
|
If True (default), attempt to parse cell values that look like JSON (lists, dicts, strings). |
True
|
field_mapping
|
dict[str, str] | None
|
Optional key renames applied before detection. |
None
|
format
|
str
|
Force a format ('auto' by default). |
'auto'
|
source_uri
|
str | None
|
Override for provenance records. |
None
|
preprocessing_fn
|
Callable | str | None
|
Callable or dotted import path. |
None
|
detection_sample_size
|
int
|
Rows to inspect before committing detection. |
10
|
HuggingFaceReader ¶
HuggingFaceReader(dataset_name: str, split: str = 'train', subset: str | None = None, streaming: bool = False, token: str | None = None, columns: list[str] | None = None, field_mapping: dict[str, str] | None = None, format: str = 'auto', source_uri: str | None = None, preprocessing_fn: Callable | str | None = None, detection_sample_size: int = 10)
Bases: BaseConnector
Load a HuggingFace dataset and produce DataSample objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
HF Hub dataset name or local path. e.g. "tatsu-lab/alpaca" or "/path/to/dataset" |
required |
split
|
str
|
Dataset split: "train", "test", "validation", etc. Defaults to "train". |
'train'
|
subset
|
str | None
|
Dataset configuration / subset name if applicable. e.g. "helpful-base" for Anthropic/hh-rlhf |
None
|
streaming
|
bool
|
If True, stream rows without full download. Disables consistency check (no random access). |
False
|
token
|
str | None
|
HF access token for gated datasets. |
None
|
columns
|
list[str] | None
|
Optional list of columns to load. If None, all columns. |
None
|
field_mapping
|
dict[str, str] | None
|
Optional key renames applied before detection. |
None
|
format
|
str
|
Force a format ('auto' by default). |
'auto'
|
source_uri
|
str | None
|
Override for provenance records. Defaults to "{dataset_name}/{split}". |
None
|
preprocessing_fn
|
Callable | str | None
|
Callable or dotted import path. |
None
|
detection_sample_size
|
int
|
Rows to inspect before committing detection. |
10
|
JSONReader ¶
JSONReader(path: Path | str, data_key: str | None = None, field_mapping: dict[str, str] | None = None, format: str = 'auto', source_uri: str | None = None, preprocessing_fn: Callable | str | None = None, detection_sample_size: int = 10)
Bases: BaseConnector
Read a .json file and produce DataSample objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to the .json file. |
required |
data_key
|
str | None
|
If the JSON is a dict wrapping a list, use this key to extract it. If None, auto-detect. |
None
|
field_mapping
|
dict[str, str] | None
|
Optional key renames applied before detection. |
None
|
format
|
str
|
Force a format ('auto' by default). |
'auto'
|
source_uri
|
str | None
|
Override for provenance records. |
None
|
preprocessing_fn
|
Callable | str | None
|
Callable or dotted import path. |
None
|
detection_sample_size
|
int
|
Rows to inspect before committing detection. |
10
|
JSONLReader ¶
JSONLReader(path: Path | str, field_mapping: dict[str, str] | None = None, format: str = 'auto', source_uri: str | None = None, preprocessing_fn: Callable | str | None = None, detection_sample_size: int = 10)
Bases: BaseConnector
Read a .jsonl file and produce DataSample objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to the .jsonl file. |
required |
field_mapping
|
dict[str, str] | None
|
Optional key renames applied before detection. Supports dot notation for nested source keys. |
None
|
format
|
str
|
Force a format ('auto' by default). |
'auto'
|
source_uri
|
str | None
|
Override for provenance records. |
None
|
preprocessing_fn
|
Callable | str | None
|
Callable or dotted import path. Signature: (dict) -> dict | DataSample | None |
None
|
detection_sample_size
|
int
|
Rows to inspect before committing detection. |
10
|
ParquetReader ¶
ParquetReader(path: Path | str, columns: list[str] | None = None, batch_size: int = 1000, field_mapping: dict[str, str] | None = None, format: str = 'auto', source_uri: str | None = None, preprocessing_fn: Callable | str | None = None, detection_sample_size: int = 10)
Bases: BaseConnector
Read a .parquet file and produce DataSample objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to the .parquet file. |
required |
columns
|
list[str] | None
|
Optional list of column names to load. If None, all columns are loaded. |
None
|
batch_size
|
int
|
Number of rows to read at a time. Default 1000. |
1000
|
field_mapping
|
dict[str, str] | None
|
Optional key renames applied before detection. |
None
|
format
|
str
|
Force a format ('auto' by default). |
'auto'
|
source_uri
|
str | None
|
Override for provenance records. |
None
|
preprocessing_fn
|
Callable | str | None
|
Callable or dotted import path. |
None
|
detection_sample_size
|
int
|
Rows to inspect before committing detection. |
10
|