Replies: 3 comments
-
Hi @haykh!
If your data can be represented as something that xarray can wrap (e.g. a numpy array), then xarray can handle the lazy indexing. To handle the huge data you will want to chunk it using dask. But this is a not a trivial use case you have here, at least not if you want to do it at scale.
You could totally try representing the times in which particles have disappeared with NaNs. That should work fine for in-memory data, but I wonder what factor overhead that would introduce? i.e. if each of your particles only lives on average for a time 1/10th the length of the whole simulation time, then 90% of your array will be NaNs. The ultimate way to solve your problem would be if xarray could wrap so-called "ragged" arrays. Then you would not be storing any NaNs. Xarray cannot yet do that, but it's something we are looking at supporting. See here for in-depth and ongoing discussions on wrapping ragged arrays in xarray. If you want to get involved there we would welcome your input! (Also I notice we're both at Columbia, and I used to do plasma physics too! We're keen to see xarray used in a wider range of fields (see also xarray-contrib/xarray.dev#272), so if you wanted to chat about xarray / open source in astrophysics at some point then let me know) |
Beta Was this translation helpful? Give feedback.
-
Hi @TomNicholas! Thanks for the response :)
Yes, that's what I'm currently doing, and for data types, such as gridded fields (for many timesteps) -- that works flawlessly.
Yeah, I think this is not really viable, because most of the data will end up being NaN.
Indeed, I think "ragged arrays" was the keyword I've been looking for. Will have a look at the thread, and see if I can help with anything.
Oh wow, I'll be happy to meet and talk; sent you an email ) |
Beta Was this translation helpful? Give feedback.
-
Another alternative is sparse. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I've been using
xarray
in conjunction withdask
for some time now, and I'm enjoying its intuitive API. So kudos to the dev team and the community for making this package so awesome!I have a question about a particular data type used in my simulations (I do plasma astrophysics) which I have so far had trouble wrapping nicely in the neat
xarray
-dataset-style container.Problem. The data consists of different properties of particles (their
x
,y
,z
positions, velocity components, weights, etc.), which are saved for each timestep of my simulation. Each particle has a unique ID, which is the same across timesteps (so the particle can be tracked in time). However, and this is the complicated part, each timestep particle can appear or disappear from the dataset (they are either inject or flow out of my simulation domain), and so the number of particles output at each timestep may be different. The particle lifetime is guaranteed to be continuous, in other words, if the particle disappears at timet1
it will not reemerge att>t1
.Requirements. A few features (access abilities) I would like to implement in the container are the following:
pick a particle by its index (ID) and load all of its memory throughout its lifetime; something similar to
ds.sel(index=...)
;pick a given timestep, and plot different quantities of particles tracked at that particular timestep; like
ds.sel(t=...).plot.scatter(...)
.Now obviously, these datasets are huge (of the order of 100 GB+), so I cannot load everything into memory at once, so ideally all these operations will be done lazily.
My attempt. When it's the same population of particles across the timesteps (i.e., no injection/deletion), I was able to make a nice
xarray
dataset withID
andtime
as dimensions, and that did the job flawlessly. However, in the case when particles can appear and disappear, this makes it quite tricky (maybe there is a way to mask or fill with NaN-s?).Any thoughts on how I could implement this type of data structure are very welcome! Also, I would appreciate any refs or links where I can dig further.
Below is a code snippet that generates an example data in the form of a simple python
dict
withnumpy
arrays.Beta Was this translation helpful? Give feedback.
All reactions