You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
Just trying to figure out if I am using xarray functionality in the most efficient way or hitting some xarray functionality limitation. I am getting pretty slow runtime as compared to our Matlab prototype code which seems to be running much faster (minutes in Matlab vs. hours in Python).
We have to apply the following filter, which bins data based on the dt variable values and then identifies invalid entries in data variable. It takes about 2.1-4 seconds to process 100 spacial points (ds.data[:, y, 100]) with dask in parallel, which is rather slow. While processing the whole dataset (with dimensions (40981, 834, 834)) it would take ~3.5 hours to run such filter for one data variable, and we need to run it for multiple variables. Matlab code seems to be doing the same job much faster and processes the whole dataset in minutes. So I wonder if I am doing something inefficient with xarray here and if there is a better and more efficient way to do it. Or do I just hit some inefficiency of xarray? Here is the code snippet that reproduces the problem:
Sequential processing seems to be running slightly faster than parallel processing with 4 threads on 4 CPUs. What could be the reason? I am getting 4.25 secs for sequential run vs. 4.45 secs for parallel run.
I posted the question on Pangeo discussion board, but thought I try to ask the question at the source, i.e. xarray :) Any help is much appreciated!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
Just trying to figure out if I am using
xarray
functionality in the most efficient way or hitting some xarray functionality limitation. I am getting pretty slow runtime as compared to our Matlab prototype code which seems to be running much faster (minutes in Matlab vs. hours in Python).We have to apply the following filter, which bins data based on the
dt
variable values and then identifies invalid entries indata
variable. It takes about 2.1-4 seconds to process 100 spacial points (ds.data[:, y, 100]) with dask in parallel, which is rather slow. While processing the whole dataset (with dimensions (40981, 834, 834)) it would take ~3.5 hours to run such filter for one data variable, and we need to run it for multiple variables. Matlab code seems to be doing the same job much faster and processes the whole dataset in minutes. So I wonder if I am doing something inefficient withxarray
here and if there is a better and more efficient way to do it. Or do I just hit some inefficiency ofxarray
? Here is the code snippet that reproduces the problem:Sequential processing:
Parallel processing:
Also:
groups = data.groupby_bins(ds.dt, bins, right=False, include_lowest=True)
to create the groups seems to take ~5 times longer, which I am also curious about. Why does it take so much longer than:
I posted the question on Pangeo discussion board, but thought I try to ask the question at the source, i.e.
xarray
:) Any help is much appreciated!Here is my xarray.show_versions():
Beta Was this translation helpful? Give feedback.
All reactions