Big data #49
Replies: 9 comments 16 replies
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
-
We were discussing how XArray and CAttrs would work together like in this PR: #62 . cattrs will always return a dictionary of some sort, so that it can be passed into packages like pyyaml or tomlkit. cattrs provides the conversion for datetimes and encoding. But cattrs is not made for file-io, because when you would want to convert everything into one big string, that would all be passed into memory in a single instance. It would be better to use functionality like jinja for file writing and just pass it the actual python instance of simulation / model / package. We can write filter functions that can handle the dask chunking while writing large datasets from a netcdf file / lazy dask array. This is something we still need to prototype. |
Beta Was this translation helpful? Give feedback.
-
I have worked on profiling the different methods to write ascii input files to disk to find what is fastest and doesn't take up too much memory. I came up with the following functions:
Here are the results:
We can see that xarray-extras takes around twice as much space for larger datasets at peak. But it is also 10x faster. I am also running tests for 1 billion points. More results later. |
Beta Was this translation helpful? Give feedback.
-
Maybe we could set some defaults that writes data to binary files when the size exceeds a certain size and provide an override option for users that want to fill their hard drive with ascii data. |
Beta Was this translation helpful? Give feedback.
-
We can conclude that it's possible to make larger models by keeping data on disk as long as possible, meaning we want to delay the calculations until the moment of writing the MODFLOW input files. We can achieve this via dask arrays that can be provided to components, instead of numpy in-memory arrays. We have done some tests on different writing methods and we see that there are external libraries ( We see that large chunks of data can be written more efficient with binary writers. We have to make a decision on where we can draw that boundary switching between binary and ascii files. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
One of the bigger questions that we still have is how we are going to handle large datasets, also depending on the type of data (grid based / array based).
We should investigate how we can work with large data models where the data is chunked, preferably on disk. We see the benefits of using a package like Dask. Package that build on top of that are numpy, xarray, and uxarray. We are wondering if a package like uxarray also supports our more complex DISU model (fully unstructured 3d grid).
We have to consider how we are going to let users provide data to the flopy package. That can be via indexed arrays, but that can also be with a grid mask where NaNs are filled in when no data is provided. Depending on the sparsity of the data that is typically used, we can make a decision on the type of data we want users to put in via the API.
Packages of interest:
XArray seems to support on disk formats for zarr and netcdf: https://docs.xarray.dev/en/stable/api.html#dataset-methods. Letting users open these data formats would only let us reference to it, perhaps write in it if needed. At the moment that the data needs to go to the MF6 input format, we can convert at that moment and not earlier. Making it more efficient.
If MF6 would support netcdf formats directly, there would be no conversion needed and it would save even more time.
Beta Was this translation helpful? Give feedback.
All reactions