Best way to concat thousands of Datasets from Dask Cluster #6705

mronda · 2022-06-18T17:36:15Z

mronda
Jun 18, 2022

Hi, I am trying to find the best way to concat thousands (>10_000) of Datasets from a Dask point of view. Basically, I generate thousands of Datasets using a Dask Distributed Cluster running on HPC hardware and would like to run a large xr.concat on my futures:

future_datasets = [client.submit(scrape_dir, dir_name) for dir_name in directories]

future_dataset = client.submit(xr.concat, future_datasets , dim="some_dim")

# Lastly I am hoping to save future_dataset to zarr store (future_dataset.to_zarr())

At the moment, xr.concat() works for about 500 Datasets but anything larger than that it just gives up and kills my workers. I can show some outputs if you guys want but I am curious what would be the best way to merge all my Datasets in memory ?

Many thanks in advance !

mronda · 2022-06-18T17:50:14Z

mronda
Jun 18, 2022
Author

Posting the error I get when I run the large concat (at the moment only 1000 Datasets):

    273         if self.status == "error":
    274             typ, exc, tb = result
--> 275             raise exc.with_traceback(tb)
    276         elif self.status == "cancelled":
    277             raise result

KilledWorker: ('concat-some_guid', <WorkerState 'tcp://address', name: SGECluster-2-6, status: closed, memory: 0, processing: 1>)

0 replies

dcherian · 2022-06-28T15:58:41Z

dcherian
Jun 28, 2022
Maintainer

Hmmm.... concat can be slow for large numbers of Datasets because it loops over them multiple times. Have you checked to make sure it isn't adding unnecessary dimensions (usually you want data_vars="minimal", coords="minimal", compat="override" but it really depends on your problem).

Can you show what the chunksizes of a single dataset is, and that for the concatenated result?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Best way to concat thousands of Datasets from Dask Cluster #6705

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Best way to concat thousands of Datasets from Dask Cluster #6705

Uh oh!

mronda Jun 18, 2022

Replies: 2 comments

Uh oh!

mronda Jun 18, 2022 Author

Uh oh!

dcherian Jun 28, 2022 Maintainer

mronda
Jun 18, 2022

mronda
Jun 18, 2022
Author

dcherian
Jun 28, 2022
Maintainer