Best way to implement lazy backend for data format with multiple files? #5860

arbennett · 2021-10-12T17:37:07Z

arbennett
Oct 12, 2021

I'm working on a new backend to read output from the Parflow hydrologic model. I've been able to get things generally working by following the docs on implementing a new backend, but running into some trouble on getting lazy loading to work, mostly due to the way that Parflow writes output.

Parflow writes output in a parflow-specific binary format. Each timestep of a simulation is written out to it's own file, where each file contains a 3d field (dimensions (x,y,z)). All of the output files are recorded in an output metadata file, which basically says what data is available and points to each of the individual files.

Basically, the way I'm doing the read in now is to defer reading of the actual binary data to a lower level library, and just looping over the files and then concating them together, more or less like:

def read_variable(self, variable_name):
    time_slice_files = self.output_metadata[variable_name]['file-series']
    # Some other metadata wrangling
    return xr.concat([self.read_parflow_file(f) for f in time_slice_files])

Per this issue the concat operation is not lazy, and means that my backend isn't very usable for large datasets (particularly long time periods). I'm wondering if there is a "right" way to implement this backend so that it supports lazy loading?

Answered by arbennett

Oct 12, 2021

Okay, after a quick side chat with @jhamman this was a trivial fix. All I had to do was expose self.read_parflow_file so that you can use xr.open_dataset with the actual files that contain the data. Then, when using the metadata file to read the full dataset I change:

def read_variable(self, variable_name):
    time_slice_files = self.output_metadata[variable_name]['file-series']
    # Some other metadata wrangling
    return xr.concat([self.read_parflow_file(f) for f in time_slice_files])

to

def read_variable(self, variable_name):
    time_slice_files = self.output_metadata[variable_name]['file-series']
    # Some other metadata wrangling
    return xr.open_mfdataset(time_slice_files, c…

View full answer

arbennett · 2021-10-12T19:31:26Z

arbennett
Oct 12, 2021
Author

Okay, after a quick side chat with @jhamman this was a trivial fix. All I had to do was expose self.read_parflow_file so that you can use xr.open_dataset with the actual files that contain the data. Then, when using the metadata file to read the full dataset I change:

def read_variable(self, variable_name):
    time_slice_files = self.output_metadata[variable_name]['file-series']
    # Some other metadata wrangling
    return xr.concat([self.read_parflow_file(f) for f in time_slice_files])

to

def read_variable(self, variable_name):
    time_slice_files = self.output_metadata[variable_name]['file-series']
    # Some other metadata wrangling
    return xr.open_mfdataset(time_slice_files, concat_dim='time', combine='nested')

🚀

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Best way to implement lazy backend for data format with multiple files? #5860

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Best way to implement lazy backend for data format with multiple files? #5860

Uh oh!

arbennett Oct 12, 2021

Replies: 1 comment

Uh oh!

arbennett Oct 12, 2021 Author

arbennett
Oct 12, 2021

arbennett
Oct 12, 2021
Author