Pelican-backed data loader prototype: demo
-
Install
pelican-data-loader
andpytorch
from pypipip install pelican-data-loader torch
-
Consume data with
datasets
from datasets import load_dataset dataset = load_dataset("csv", data_files="pelican://uwdf-director.chtc.wisc.edu/wisc.edu/dsi/pytorch/bird_migration_data.csv") torch_dataset = dataset.with_format("torch")
- Uses
Croissant
to store / validate metadata - Uses
pelicanfs
to locate/cache dataset - Uses
datasets
to convert to different ML data format (e.g., pytorch, tensorflow, jax, polars, pyarrow...) - Provided dataset storage via UW-Madison's S3
doi
minting via DataCite- better frontend for dataset discover and publishing
- backup
- data prefetching? (at pelican layer?)
- private datasets
- telemetry?
- WISC-S3, storing
- Actual datasets
- Croissant JSONLD
- Postgres, storing
- Various metadata
- Links to pelican data source
- Links to Croissant JSONLD
- Licenses data: pull from SPDX with
pelican_data_loader.data.pull_license
. - minimal csv file croissant generator:
pelican_data_loader.utils.parse_col
.