Writing a netCDF file is slow #6921

lassiterdc · 2022-08-16T14:48:37Z

lassiterdc
Aug 16, 2022

What is your issue?

This has been discussed in another thread, but the proposed solution there (first .load() the dataset into memory before running to_netcdf) does not work for me since my dataset is too large to fit into memory. The following code takes around 8 hours to run. You'll notice that I tried both xr.open_mfdataset and xr.concat in case it would make a difference, but it doesn't. I also tried profiling the code according to this example. The results are in this html (dropbox link) but I'm not really sure what I'm looking at.

Data: dropbox link to 717 netcdf files containing radar rainfall data for 6/28/2014 over the United States that is around 1GB in total.

Code:

#%% Import libraries
import xarray as xr
from glob import glob
import pandas as pd
import time
import dask
dask.config.set(**{'array.slicing.split_large_chunks': False})

files =  glob("data/*.nc")
#%% functions
def extract_file_timestep(fname):
    fname = fname.split('/')[-1]
    fname = fname.split(".")
    ftype = fname.pop(-1)
    fname = ''.join(fname)
    str_tstep = fname.split("_")[-1]
    if ftype == "nc":
        date_format = '%Y%m%d%H%M'
    if ftype == "grib2":
        date_format = '%Y%m%d-%H%M%S'

    tstep = pd.to_datetime(str_tstep, format=date_format)

    return tstep

def ds_preprocessing(ds):
    tstamp = extract_file_timestep(ds.encoding['source'])
    ds.coords["time"] = tstamp
    ds = ds.expand_dims({"time":1})
    ds = ds.rename({"lon":"longitude", "lat":"latitude", "mrms_a2m":"rainrate"})
    ds = ds.chunk(chunks={"latitude":3500, "longitude":7000, "time":1})
    return ds

#%% Loading and formatting data
lst_ds = []
start_time = time.time()
for f in files:
    ds = xr.open_dataset(f, chunks={"latitude":3500, "longitude":7000})
    ds = ds_preprocessing(ds)
    lst_ds.append(ds)

ds_comb_frm_lst = xr.concat(lst_ds, dim="time")
print("Time to load dataset using concat on list of datasets: {}".format(time.time() - start_time))

start_time = time.time()
ds_comb_frm_open_mfdataset = xr.open_mfdataset(files, chunks={"latitude":3500, "longitude":7000},
                                               concat_dim = "time", preprocess=ds_preprocessing, combine="nested")
print("Time to load dataset using open_mfdataset: {}".format(time.time() - start_time))
#%% exporting to netcdf
start_time = time.time()
ds_comb_frm_lst.to_netcdf("ds_comb_frm_lst.nc", encoding= {"rainrate":{"zlib":True}})
print("Time to export dataset created using concat on list of datasets: {}".format(time.time() - start_time))

start_time = time.time()
ds_comb_frm_open_mfdataset.to_netcdf("ds_comb_frm_open_mfdataset.nc", encoding= {"rainrate":{"zlib":True}})
print("Time to export dataset created using open_mfdataset: {}".format(time.time() - start_time))

andersy005 · 2022-08-16T15:46:44Z

andersy005
Aug 16, 2022
Maintainer

@lassiterdc, writing large, chunked xarray dataset to a netCDF file is always a challenge and quite slow since the write is serial. However, you could take advantage of the xr.save_mfdataset() function to write to multiple netCDF files. here's a good example that showcase how to achieve this: https://ncar.github.io/esds/posts/2020/writing-multiple-netcdf-files-in-parallel-with-xarray-and-dask

0 replies

lassiterdc · 2022-08-16T16:59:47Z

lassiterdc
Aug 16, 2022
Author

Thanks, @andersy005. I think that xr.save_mfdataset() could certainly be helpful in my workflow but unfortunately, I have to consolidate these data from a netcdf for each 2-minute timestep to a netcdf for each day, and it sounds like there's no way around that bottleneck. I've come across suggestions to save the dataset to a zarr group and then export as a netcdf, so I'm going to give that a shot.

0 replies

andersy005 · 2022-08-16T17:05:24Z

andersy005
Aug 16, 2022
Maintainer

Great... keep us posted once you have a working solution.

I'm going to convert this issue in a discussion instead.

1 reply

lassiterdc Aug 16, 2022
Author

Well the writing to a single zarr file was much faster than writing to a netcdf file at around 10 minutes which is kind of interesting, but it doesn't appear that an xarray loaded from that zarr dataset can be exported to a netcdf much faster than just using the original netcdf data directly. Now I'm trying out different chunk sizes to see if I can eek out more performance.

I just tried out save_mfdataset for 2 days worth of data (dropbox download link) and it seems that it's still computing serially because only one of the output netcdfs is accumulating data based on observing the file sizes. Do you know how to parallelize the process?

Here's code to reproduce (assuming previous libraries and defined functions):

ds_comb_frm_open_mfdataset = xr.open_mfdataset(files, chunks={"latitude":3500, "longitude":7000},
                                               concat_dim = "time", preprocess=ds_preprocessing, combine="nested")

dates, lst_ds = zip(*ds_comb_frm_open_mfdataset.groupby("time.date"))
paths = [f"{str(d)}.nc" for d in dates]

start_time = time.time()
xr.save_mfdataset(lst_ds, paths)
print("Time to save 2 days worth of netcdfs using save_mfdataset: {}".format(time.time() - start_time))

lassiterdc · 2022-08-16T21:17:37Z

lassiterdc
Aug 16, 2022
Author

Update on expediting to_netcdf: After noticing that running to_netcdf is faster if the dataset has first been loaded into memory (2 minutes without .load() versus around 45 seconds with .load()), I wrote the following code to append a netcdf in hour increments. However, there appears to be a memory leakage since my consumed RAM just goes up and up until I hit a memory allocation error. Is there a way to clear the previously loaded data from memory before going through the next loop?

Data - same as original post

ds_comb_frm_open_mfdataset = xr.open_mfdataset(files, chunks={"latitude":3500, "longitude":7000},
                                               concat_dim = "time", preprocess=ds_preprocessing, combine="nested")

dates, lst_ds = zip(*ds_comb_frm_open_mfdataset.groupby("time.hour"))

start_time = time.time()
ds = lst_ds[0]
ds.to_netcdf("ds_1hr_no_loading.nc", mode="w", encoding= {"rainrate":{"zlib":True}})
print("Time to export 1 hour worth of data without first loading it into memory: {}".format(time.time() - start_time))

start_time = time.time()
i = -1
for ds in lst_ds:
    i += 1
    if i == 0:
        ds.load().to_netcdf("ds_by_appending.nc", mode="w", encoding= {"rainrate":{"zlib":True}})
        print("Time to export 1 hour worth of data after first loading it into memory: {}".format(time.time() - start_time))
    ds.load().to_netcdf("ds_by_appending.nc", mode="a", encoding= {"rainrate":{"zlib":True}})

print("Time to export dataset created by appending netcdf: {}".format(time.time() - start_time))

MemoryError: Unable to allocate 2.74 GiB for an array with shape (30, 3500, 7000) and data type float32

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Writing a netCDF file is slow #6921

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Writing a netCDF file is slow #6921

Uh oh!

Uh oh!

lassiterdc Aug 16, 2022

What is your issue?

Replies: 4 comments · 1 reply

Uh oh!

andersy005 Aug 16, 2022 Maintainer

Uh oh!

lassiterdc Aug 16, 2022 Author

Uh oh!

andersy005 Aug 16, 2022 Maintainer

Uh oh!

lassiterdc Aug 16, 2022 Author

Uh oh!

Uh oh!

lassiterdc Aug 16, 2022 Author

lassiterdc
Aug 16, 2022

Replies: 4 comments 1 reply

andersy005
Aug 16, 2022
Maintainer

lassiterdc
Aug 16, 2022
Author

andersy005
Aug 16, 2022
Maintainer

lassiterdc Aug 16, 2022
Author

lassiterdc
Aug 16, 2022
Author