Big data #49

deltamarnix · 2024-11-28T12:56:46Z

deltamarnix
Nov 28, 2024
Collaborator

One of the bigger questions that we still have is how we are going to handle large datasets, also depending on the type of data (grid based / array based).

We should investigate how we can work with large data models where the data is chunked, preferably on disk. We see the benefits of using a package like Dask. Package that build on top of that are numpy, xarray, and uxarray. We are wondering if a package like uxarray also supports our more complex DISU model (fully unstructured 3d grid).

We have to consider how we are going to let users provide data to the flopy package. That can be via indexed arrays, but that can also be with a grid mask where NaNs are filled in when no data is provided. Depending on the sparsity of the data that is typically used, we can make a decision on the type of data we want users to put in via the API.

import xarray as xa
import uxarray as uxa

class ModflowGwfriv(mfpackage.MFPackage):
    def __init__(self, stress_period_data: Union[xa.Grid, uxa.UnstructuredGrid, ...], ...):
        # If the river package requires information on all cells, it's likely best to accept a full grid
        pass

import numpy as np

class  ModflowGwfriv(mfpackage.MFPackage):
    def __init__(self, stress_period_data: np.ArrayLike, ...):
        # This option would not give the user much information. The array is a 1-d array where the order of the data is important, like the current implementation. [cellid, stage, cond, rbot, aux, boundname]
        pass

import numpy as np

class  ModflowGwfriv(mfpackage.MFPackage):
    def __init__(self, cellids: np.ArrayLike, stage: np.ArrayLike, cond: np.ArrayLike, rbot: np.ArrayLike, aux: np.ArrayLike, boundname: np.ArrayLike, ...):
        # As an alternative, the user could supply separate arrays and we combine it for the user and check that the lengths of these arrays are equal. It gives the user more information on what needs to go in, but the documentation has to be very clear on what is considered equal for the user.
        pass

import pandas as pa

class  ModflowGwfriv(mfpackage.MFPackage):
    def __init__(self, stress_period_data: pa.DataFrame, ...):
        # Similar to a single np.ArrayLike, we could use pandas DataFrame so that the data is labelled and the order of the data is not relevant anymore.
        pass

Packages of interest:

Xarray
UXArray
Zarr
XUGrid
Dask
Numpy

XArray seems to support on disk formats for zarr and netcdf: https://docs.xarray.dev/en/stable/api.html#dataset-methods. Letting users open these data formats would only let us reference to it, perhaps write in it if needed. At the moment that the data needs to go to the MF6 input format, we can convert at that moment and not earlier. Making it more efficient.
If MF6 would support netcdf formats directly, there would be no conversion needed and it would save even more time.

deltamarnix · 2024-12-13T16:30:15Z

deltamarnix
Dec 13, 2024
Collaborator Author

We were discussing how XArray and CAttrs would work together like in this PR: #62 .

cattrs will always return a dictionary of some sort, so that it can be passed into packages like pyyaml or tomlkit. cattrs provides the conversion for datetimes and encoding. But cattrs is not made for file-io, because when you would want to convert everything into one big string, that would all be passed into memory in a single instance.

It would be better to use functionality like jinja for file writing and just pass it the actual python instance of simulation / model / package. We can write filter functions that can handle the dask chunking while writing large datasets from a netcdf file / lazy dask array. This is something we still need to prototype.

1 reply

This comment has been minimized.

Sign in to view

deltamarnix · 2025-01-02T14:49:47Z

deltamarnix
Jan 2, 2025
Collaborator Author

I have worked on profiling the different methods to write ascii input files to disk to find what is fastest and doesn't take up too much memory.

I came up with the following functions:

xarray-extras: A small library with a C++ implementation for fast writing of xarray to csv. And csv is what we want, in a way, but with spaces instead of commas.
xarray.to_pandas().to_csv(): The built-in function to write to csv files.
np.save_txt(): A simple function from numpy. We call this function block for block via xarray.
jinja calling np.array2string(): A similar function to save_txt, but not to a file handle.

Here are the results:

test	args	peak memory (MB)	time (s)
create_and_write_extras	{'max_size': 1000000, 'chunks': 10000}	18.941005 MB	0.6280782222747803
create_and_write_extras	{'max_size': 10000000, 'chunks': 10000}	162.426803 MB	4.197936773300171
create_and_write_pandas	{'max_size': 1000000, 'chunks': 10000}	17.273536 MB	0.6013274192810059
create_and_write_pandas	{'max_size': 10000000, 'chunks': 10000}	164.558039 MB	8.126498222351074
create_and_write_np_savetxt	{'max_size': 1000000, 'chunks': 10000}	8.641115 MB	3.409224033355713
create_and_write_np_savetxt	{'max_size': 10000000, 'chunks': 10000}	83.415583 MB	43.03009843826294
create_and_write_jinja	{'max_size': 1000000, 'chunks': 10000}	8.649213 MB	0.44498276710510254
create_and_write_jinja	{'max_size': 10000000, 'chunks': 10000}	83.400212 MB	21.505234479904175

We can see that xarray-extras takes around twice as much space for larger datasets at peak. But it is also 10x faster.

I am also running tests for 1 billion points. More results later.

7 replies

deltamarnix Jan 3, 2025
Collaborator Author

My advice would be to go for xarray-extras for ascii writing. The package is a fairly simple C++ implementation. It's faster than all other options I have explored, and the memory usage is doable.

One limitation would be that it writes all values onto one line. We have no control over that.

deltamarnix Jan 3, 2025
Collaborator Author

I also found out that imod-python is using the pandas.to_csv() function as already tested here for writing a disv schematization.

https://github.com/Deltares/imod-python/blob/eb7998a36b1bec47e589b6e4cb4ccf2b6cb2cb80/imod/mf6/disv.py#L110

https://github.com/Deltares/imod-python/blob/eb7998a36b1bec47e589b6e4cb4ccf2b6cb2cb80/imod/mf6/disv.py#L141-L143

deltamarnix Jan 6, 2025
Collaborator Author

I also did some tests with binary writing

test	args	memory (MB)	time (s)
create_and_write_flatten	{'max_size': 1000000000, 'chunks': 10000000}	Out of memory exception	Out of memory exception
create_and_write_flatten_blocks	{'max_size': 1000000000, 'chunks': 10000000}	483.090754 MB	30.942705392837524

deltamarnix Jan 8, 2025
Collaborator Author

@mjreno I have a very simple test for netcdf with xarray. It's a simple xarray.to_netcdf().

test	args	time (s)
create_and_write_netcdf	{'max_size': 1000000000, 'chunks': 10000000}	107.73747897148132

deltamarnix Jan 20, 2025
Collaborator Author

I did a 2D test with 1_000_000_000 values reshaped:

test	args	time (s)
create_and_write_extras_2d	{'shape': (25000, 40000)}	125.32499647140503

So that's similar to 1D data.

jdhughes-dev · 2025-01-02T16:42:52Z

jdhughes-dev
Jan 2, 2025
Maintainer

Maybe we could set some defaults that writes data to binary files when the size exceeds a certain size and provide an override option for users that want to fill their hard drive with ascii data.

0 replies

deltamarnix · 2025-01-29T14:24:46Z

deltamarnix
Jan 29, 2025
Collaborator Author

We can conclude that it's possible to make larger models by keeping data on disk as long as possible, meaning we want to delay the calculations until the moment of writing the MODFLOW input files. We can achieve this via dask arrays that can be provided to components, instead of numpy in-memory arrays.

We have done some tests on different writing methods and we see that there are external libraries (xarray-extras) that are efficient, but still take a large chunk of memory. The built-in xarray methods for writing can be more memory efficient than external libraries like xarray-extras, but take longer to write MODFLOW input files. Both 1D and 2D data has a similar performance.

We see that large chunks of data can be written more efficient with binary writers. We have to make a decision on where we can draw that boundary switching between binary and ascii files.

0 replies

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

Big data #49

Uh oh!

deltamarnix Nov 28, 2024 Collaborator

Replies: 9 comments · 16 replies

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been hidden.

This comment has been hidden.

Uh oh!

deltamarnix Dec 13, 2024 Collaborator Author

This comment has been minimized.

Uh oh!

deltamarnix Jan 2, 2025 Collaborator Author

Uh oh!

deltamarnix Jan 3, 2025 Collaborator Author

Uh oh!

Uh oh!

deltamarnix Jan 3, 2025 Collaborator Author

Uh oh!

Uh oh!

deltamarnix Jan 6, 2025 Collaborator Author

Uh oh!

deltamarnix Jan 8, 2025 Collaborator Author

Uh oh!

deltamarnix Jan 20, 2025 Collaborator Author

Uh oh!

jdhughes-dev Jan 2, 2025 Maintainer

Uh oh!

Uh oh!

deltamarnix Jan 29, 2025 Collaborator Author

deltamarnix
Nov 28, 2024
Collaborator

Replies: 9 comments 16 replies

deltamarnix
Dec 13, 2024
Collaborator Author

deltamarnix
Jan 2, 2025
Collaborator Author

deltamarnix Jan 3, 2025
Collaborator Author

deltamarnix Jan 3, 2025
Collaborator Author

deltamarnix Jan 6, 2025
Collaborator Author

deltamarnix Jan 8, 2025
Collaborator Author

deltamarnix Jan 20, 2025
Collaborator Author

jdhughes-dev
Jan 2, 2025
Maintainer

deltamarnix
Jan 29, 2025
Collaborator Author