Seeking advice on organizing the data access for a particular type of data #7715

haykh · 2023-04-04T05:48:47Z

haykh
Apr 4, 2023

Hi, I've been using xarray in conjunction with dask for some time now, and I'm enjoying its intuitive API. So kudos to the dev team and the community for making this package so awesome!

I have a question about a particular data type used in my simulations (I do plasma astrophysics) which I have so far had trouble wrapping nicely in the neat xarray-dataset-style container.

Problem. The data consists of different properties of particles (their x, y, z positions, velocity components, weights, etc.), which are saved for each timestep of my simulation. Each particle has a unique ID, which is the same across timesteps (so the particle can be tracked in time). However, and this is the complicated part, each timestep particle can appear or disappear from the dataset (they are either inject or flow out of my simulation domain), and so the number of particles output at each timestep may be different. The particle lifetime is guaranteed to be continuous, in other words, if the particle disappears at time t1 it will not reemerge at t>t1.

Requirements. A few features (access abilities) I would like to implement in the container are the following:

pick a particle by its index (ID) and load all of its memory throughout its lifetime; something similar to ds.sel(index=...);
pick a given timestep, and plot different quantities of particles tracked at that particular timestep; like ds.sel(t=...).plot.scatter(...).

Now obviously, these datasets are huge (of the order of 100 GB+), so I cannot load everything into memory at once, so ideally all these operations will be done lazily.

My attempt. When it's the same population of particles across the timesteps (i.e., no injection/deletion), I was able to make a nice xarray dataset with ID and time as dimensions, and that did the job flawlessly. However, in the case when particles can appear and disappear, this makes it quite tricky (maybe there is a way to mask or fill with NaN-s?).

Any thoughts on how I could implement this type of data structure are very welcome! Also, I would appreciate any refs or links where I can dig further.

Below is a code snippet that generates an example data in the form of a simple python dict with numpy arrays.

import numpy as np


def simulate_particles(Ndot, nt):
    data, x, y, vx, vy, ID = {}, [], [], [], [], []
    cntr = 0
    for i in range(nt):
        # remove
        if len(x) > 0:
            remove = np.random.rand(len(x)) < 0.1
            x, y, vx, vy, ID = (
                x[~remove],
                y[~remove],
                vx[~remove],
                vy[~remove],
                ID[~remove],
            )

        # add
        x = np.hstack([x, np.random.rand(Ndot)])
        y = np.hstack([y, np.random.rand(Ndot)])
        vx = np.hstack([vx, np.random.randn(Ndot)])
        vy = np.hstack([vy, np.random.randn(Ndot)])
        ID = np.hstack([ID, np.arange(cntr, cntr + Ndot)])
        cntr += Ndot

        # update
        x, y = x + vx, y + vy
        vx, vy = vx + np.random.randn(len(vx)), vy + np.random.randn(len(vy))

        # save
        data[i] = {"x": x, "y": y, "vx": vx, "vy": vy, "ID": ID}

    return data


DATA = simulate_particles(10, 100)

TomNicholas · 2023-04-07T20:35:47Z

TomNicholas
Apr 7, 2023
Maintainer

Hi @haykh!

Now obviously, these datasets are huge (of the order of 100 GB+), so I cannot load everything into memory at once, so ideally all these operations will be done lazily.

If your data can be represented as something that xarray can wrap (e.g. a numpy array), then xarray can handle the lazy indexing. To handle the huge data you will want to chunk it using dask. But this is a not a trivial use case you have here, at least not if you want to do it at scale.

My attempt. When it's the same population of particles across the timesteps (i.e., no injection/deletion), I was able to make a nice xarray dataset with ID and time as dimensions, and that did the job flawlessly. However, in the case when particles can appear and disappear, this makes it quite tricky (maybe there is a way to mask or fill with NaN-s?).

You could totally try representing the times in which particles have disappeared with NaNs. That should work fine for in-memory data, but I wonder what factor overhead that would introduce? i.e. if each of your particles only lives on average for a time 1/10th the length of the whole simulation time, then 90% of your array will be NaNs.

The ultimate way to solve your problem would be if xarray could wrap so-called "ragged" arrays. Then you would not be storing any NaNs. Xarray cannot yet do that, but it's something we are looking at supporting. See here for in-depth and ongoing discussions on wrapping ragged arrays in xarray. If you want to get involved there we would welcome your input!

(Also I notice we're both at Columbia, and I used to do plasma physics too! We're keen to see xarray used in a wider range of fields (see also xarray-contrib/xarray.dev#272), so if you wanted to chat about xarray / open source in astrophysics at some point then let me know)

0 replies

haykh · 2023-04-09T01:25:12Z

haykh
Apr 9, 2023
Author

Hi @TomNicholas! Thanks for the response :)

If your data can be represented as something that xarray can wrap (e.g. a numpy array), then xarray can handle the lazy indexing. To handle the huge data you will want to chunk it using dask. But this is a not a trivial use case you have here, at least not if you want to do it at scale.

Yes, that's what I'm currently doing, and for data types, such as gridded fields (for many timesteps) -- that works flawlessly.

You could totally try representing the times in which particles have disappeared with NaNs. That should work fine for in-memory data, but I wonder what factor overhead that would introduce? i.e. if each of your particles only lives on average for a time 1/10th the length of the whole simulation time, then 90% of your array will be NaNs.

Yeah, I think this is not really viable, because most of the data will end up being NaN.

The ultimate way to solve your problem would be if xarray could wrap so-called "ragged" arrays. Then you would not be storing any NaNs. Xarray cannot yet do that, but it's something we are looking at supporting. #4285. If you want to get involved there we would welcome your input!

Indeed, I think "ragged arrays" was the keyword I've been looking for. Will have a look at the thread, and see if I can help with anything.

(Also I notice we're both at Columbia, and I used to do plasma physics too! We're keen to see xarray used in a wider range of fields (see also xarray-contrib/xarray.dev#272), so if you wanted to chat about xarray / open source in astrophysics at some point then let me know)

Oh wow, I'll be happy to meet and talk; sent you an email )

0 replies

dcherian · 2023-04-10T14:45:28Z

dcherian
Apr 10, 2023
Maintainer

Another alternative is sparse.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Seeking advice on organizing the data access for a particular type of data #7715

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Seeking advice on organizing the data access for a particular type of data #7715

Uh oh!

Uh oh!

haykh Apr 4, 2023

Replies: 3 comments

Uh oh!

Uh oh!

TomNicholas Apr 7, 2023 Maintainer

Uh oh!

haykh Apr 9, 2023 Author

Uh oh!

dcherian Apr 10, 2023 Maintainer

haykh
Apr 4, 2023

TomNicholas
Apr 7, 2023
Maintainer

haykh
Apr 9, 2023
Author

dcherian
Apr 10, 2023
Maintainer