Skip to content

Support parallel download of many files #882

@senderista

Description

@senderista

Right now UriSource implicitly supports downloading all files directly under the input URL (by appending a wildcard character * to the URL), but downloads are strictly sequential (and go through the coordinator). We should support this at least in the REST API (if not MyriaL), by adding a new endpoint parallelIngestDatasets, which would take either a URL wildcard expression (which would be evaluated by org.apache.hadoop.fs.FileSystem.globStatus() as in UriSource) or a list of URLs (possibly in a separate endpoint), and distribute the downloads over all available workers (using the file sizes reported by org.apache.hadoop.fs.FileSystem.getFileStatus().getLen() and some greedy bin packing heuristic). We could then replace the parallel ingest API in myria-python by a call to this REST API. Eventually we could consider supporting parallel downloads directly in MyriaL.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions