-
Notifications
You must be signed in to change notification settings - Fork 46
Description
Right now UriSource
implicitly supports downloading all files directly under the input URL (by appending a wildcard character *
to the URL), but downloads are strictly sequential (and go through the coordinator). We should support this at least in the REST API (if not MyriaL), by adding a new endpoint parallelIngestDatasets
, which would take either a URL wildcard expression (which would be evaluated by org.apache.hadoop.fs.FileSystem.globStatus()
as in UriSource
) or a list of URLs (possibly in a separate endpoint), and distribute the downloads over all available workers (using the file sizes reported by org.apache.hadoop.fs.FileSystem.getFileStatus().getLen()
and some greedy bin packing heuristic). We could then replace the parallel ingest API in myria-python
by a call to this REST API. Eventually we could consider supporting parallel downloads directly in MyriaL.