Support parallel download of many files

Right now `UriSource` implicitly supports downloading all files directly under the input URL (by appending a wildcard character `*` to the URL), but downloads are strictly sequential (and go through the coordinator). We should support this at least in the REST API (if not MyriaL), by adding a new endpoint `parallelIngestDatasets`, which would take either a URL wildcard expression (which would be evaluated by `org.apache.hadoop.fs.FileSystem.globStatus()` as in `UriSource`) or a list of URLs (possibly in a separate endpoint), and distribute the downloads over all available workers (using the file sizes reported by `org.apache.hadoop.fs.FileSystem.getFileStatus().getLen()` and some greedy bin packing heuristic). We could then replace the parallel ingest API in `myria-python` by a call to this REST API. Eventually we could consider supporting parallel downloads directly in MyriaL.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support parallel download of many files #882

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support parallel download of many files #882

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions