Skip to content

Commit 472b66f

Browse files
committed
Micro optimize dataset.isel for speed on large datasets
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of #9001) - 2.5k its/s * With #9002 - 4.2k its/s * With this Pull Request (on top of #9002) -- 6.1k its/s Thanks for considering.
1 parent 50f8726 commit 472b66f

File tree

1 file changed

+19
-4
lines changed

1 file changed

+19
-4
lines changed

xarray/core/dataset.py

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2980,20 +2980,35 @@ def isel(
29802980
coord_names = self._coord_names.copy()
29812981

29822982
indexes, index_variables = isel_indexes(self.xindexes, indexers)
2983+
all_keys = set(indexers.keys())
29832984

29842985
for name, var in self._variables.items():
29852986
# preserve variable order
29862987
if name in index_variables:
29872988
var = index_variables[name]
2988-
else:
2989-
var_indexers = {k: v for k, v in indexers.items() if k in var.dims}
2990-
if var_indexers:
2989+
dims.update(zip(var.dims, var.shape))
2990+
# Fastpath, skip all of this for variables with no dimensions
2991+
# Keep the result cached for future dictionary update
2992+
elif var_dims := var.dims:
2993+
# Large datasets with alot of metadata will have many scalars
2994+
# without any relevant dimensions for slicing.
2995+
# Pick those out quickly
2996+
# Very likey many variables will not interact with the keys at
2997+
# all, just avoid iterating through thing
2998+
var_indexer_keys = all_keys.intersection(var_dims)
2999+
if var_indexer_keys:
3000+
var_indexers = {
3001+
k: indexers[k]
3002+
for k in var_indexer_keys
3003+
}
29913004
var = var.isel(var_indexers)
29923005
if drop and var.ndim == 0 and name in coord_names:
29933006
coord_names.remove(name)
29943007
continue
3008+
# Update after slicing
3009+
var_dims = var.dims
3010+
dims.update(zip(var_dims, var.shape))
29953011
variables[name] = var
2996-
dims.update(zip(var.dims, var.shape))
29973012

29983013
return self._construct_direct(
29993014
variables=variables,

0 commit comments

Comments
 (0)