Dimension order impact on simple workflow (dataset -> sel -> to_dataframe) #5579
Replies: 1 comment 6 replies
-
Hi @theomasson I'm not that experienced in this area but I'll try an respond briefly. I get the same results as you for the local files. That does suggests the slowness is unrelated to the dimension order — is there anything else that's different between the files?
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everybody, we encountered an issue with the use of dataset, I hope it can be helpful, with my colleagues we think it's more relevant in bug than in the help section. Thank you in advance and sorry if it's not in the good place.
What happened:
We use netcdf4 to store large datasets on a google bucket. These dataset have at least 3 dimension (latitude, longitude, time) and sometimes more. The size of files is often several Go with 3 to 100 variables.
In order to comply with netcdf4 recommendation, all our recent files contain variables with the following dimension : [time, latitude, longitude]
From these netcdf4, we usually want to extract a point (in latitude/longitude) for all times and convert it to a dataframe to use it. So the workflow is pretty simple:
When y made this operation on our historic of files, the time to do these three step was very low (3-5ms) on a part of the files, and very very slow on the other part (60-120s !!). The only thing which change between these files is the fact that the first have [latitude, longitude, time] as variable dimension when the other ones (the slower ones) have [time, latitude, longitude]
More precisely, when trying to reproduce the problem localy (without distant bucket) I found a coeff 3 in time between these two option (3 variable dataset, 1.7Go file), so we keep a ms time to request a point. But if I use a low-speed device or again the distant bucket, the time increase drastically. From my observation it seems that in the second case the to_dataframe part (or a .values also) load the entire dataset. So with distant files it download the entire file, which is coherent with colleague comparison of the time (for low speed connection my 50s can be converted to 10-15min easily).
Test where made with decode_times=True,decode_cf=True in the open_dataset, without improvement. The only thing which actually work is to transpose dimension after opening the dataset. (Edit, actually did not work as expected, should have miss something)
What you expected to happen:
Same request time with both dimension order. Not downloading file to extract one point.
Minimal Complete Verifiable Example:
The MCVE is not easy as test are more relevant with a distant storage.
Results for local file (confirmed with a mean over 100 iteration):
time in third (fast)
sel 0.0072247982025146484
to dataframe 0.0017933845520019531
time in first (slow)
sel 0.007394075393676758
to dataframe 0.0036745071411132812
Results for bucket file (my connection is around 30Mo/s for downloading):
time in third (fast)
sel 0.5228869915008545
to dataframe 0.45035338401794434
time in first (slow)
sel 0.5486464500427246
to dataframe 54.501604318618774
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.7 (default, Mar 26 2020, 15:48:22)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-77-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2
xarray: 0.18.0
pandas: 1.2.4
numpy: 1.19.2
scipy: 1.4.1
netCDF4: 1.5.1.2
pydap: None
h5netcdf: 0.11.0
h5py: 2.10.0
Nio: None
zarr: 2.8.1
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.0.21
cfgrib: 0.9.8.2
iris: None
bottleneck: None
dask: 2021.06.2
distributed: 2021.06.2
matplotlib: 3.1.3
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 52.0.0.post20210125
pip: 21.1.2
conda: None
pytest: 6.2.4
IPython: 7.19.0
sphinx: None
Beta Was this translation helpful? Give feedback.
All reactions