Skip to content

UW-Madison-DSI/pelican-data-loader

Repository files navigation

pelican-data-loader

Pelican-backed data loader prototype: demo

Quickstart

  1. Install pelican-data-loader and pytorch from pypi

    pip install pelican-data-loader torch
  2. Consume data with datasets

    from datasets import load_dataset
    dataset = load_dataset("csv", data_files="pelican://uwdf-director.chtc.wisc.edu/wisc.edu/dsi/pytorch/bird_migration_data.csv")
    torch_dataset = dataset.with_format("torch")

Features

  • Uses Croissant to store / validate metadata
  • Uses pelicanfs to locate/cache dataset
  • Uses datasets to convert to different ML data format (e.g., pytorch, tensorflow, jax, polars, pyarrow...)
  • Provided dataset storage via UW-Madison's S3

Future features (Pending)

  • doi minting via DataCite
  • better frontend for dataset discover and publishing
  • backup
  • data prefetching? (at pelican layer?)
  • private datasets
  • telemetry?

Backend

  • WISC-S3, storing
    • Actual datasets
    • Croissant JSONLD
  • Postgres, storing
    • Various metadata
    • Links to pelican data source
    • Links to Croissant JSONLD

Dev notes

  • Licenses data: pull from SPDX with pelican_data_loader.data.pull_license.
  • minimal csv file croissant generator: pelican_data_loader.utils.parse_col.

About

Pelican-backed data loader prototype.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published