Vectorized indexing on two large arrays. #7791

SohumB · 2023-04-27T07:36:33Z

SohumB
Apr 27, 2023

Hi all. I'm trying to run an xarray computation, and I can't figure out how to do it in a way that doesn't blow my memory usage.

>> foo
<xarray.DataArray 'foo' (time: 125560, lat: 192, lon: 288)>
dask.array<open_dataset-6cd755527450782900ac724b8dfc6443foo, shape=(125560, 192, 288), dtype=float32, chunksize=(496, 192, 288), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0
  * lon      (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
  * time     (time) datetime64[ns] 2015-01-01 ... 2100-12-31T18:00:00
>> time
<xarray.DataArray 'time' (date: 31390, lat: 192, lon: 288)>
dask.array<open_dataset-ec860d52be70a7aa840f50a6c063d53ctime, shape=(31390, 192, 288), dtype=datetime64[ns], chunksize=(1962, 12, 36), chunktype=numpy.ndarray>
Coordinates:
  * date     (date) datetime64[ns] 2015-01-01 2015-01-02 ... 2100-12-31
  * lat      (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0
  * lon      (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8

I'd like to run foo.sel(time=time). However, when I run this, it doesn't behave lazily and return immediately, like I'm used to with dask arrays. It immediately starts processing this computation — and these dataarrays are far too big to store in memory on my machine, so that crashes.

Okay, I have a cluster available via DaskGateway, so I figured this I'd try it there. This task certainly seems like it should be parallelisable.

But when I run this under the distributed scheduler, xarray seems to want to open time remotely on each worker and then gather the entirety of those blocks back onto my machine, which causes the dask-gateway-scheduler to crash with an OOM error.

---------------------------------------------------------------------------
StreamClosedError                         Traceback (most recent call last)
File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py:225, in TCP.read(self, deserializers)
    224 try:
--> 225     frames_nbytes = await stream.read_bytes(fmt_size)
    226     (frames_nbytes,) = struct.unpack(fmt, frames_nbytes)

StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

CommClosedError                           Traceback (most recent call last)
Cell In [15], line 1
----> 1 foo.sel(time=time)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/dataarray.py:1523, in DataArray.sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   1413 def sel(
   1414     self: T_DataArray,
   1415     indexers: Mapping[Any, Any] = None,
   (...)
   1419     **indexers_kwargs: Any,
   1420 ) -> T_DataArray:
   1421     """Return a new DataArray whose data is given by selecting index
   1422     labels along the specified dimension(s).
   1423 
   (...)
   1521     Dimensions without coordinates: points
   1522     """
-> 1523     ds = self._to_temp_dataset().sel(
   1524         indexers=indexers,
   1525         drop=drop,
   1526         method=method,
   1527         tolerance=tolerance,
   1528         **indexers_kwargs,
   1529     )
   1530     return self._from_temp_dataset(ds)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/dataset.py:2550, in Dataset.sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   2489 """Returns a new dataset with each array indexed by tick labels
   2490 along the specified dimension(s).
   2491 
   (...)
   2547 DataArray.sel
   2548 """
   2549 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 2550 query_results = map_index_queries(
   2551     self, indexers=indexers, method=method, tolerance=tolerance
   2552 )
   2554 if drop:
   2555     no_scalar_variables = {}

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/indexing.py:183, in map_index_queries(obj, indexers, method, tolerance, **indexers_kwargs)
    181         results.append(IndexSelResult(labels))
    182     else:
--> 183         results.append(index.sel(labels, **options))  # type: ignore[call-arg]
    185 merged = merge_sel_results(results)
    187 # drop dimension coordinates found in dimension indexers
    188 # (also drop multi-index if any)
    189 # (.sel() already ensures alignment)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/indexes.py:475, in PandasIndex.sel(self, labels, method, tolerance)
    473     indexer = label_array
    474 else:
--> 475     indexer = get_indexer_nd(self.index, label_array, method, tolerance)
    476     if np.any(indexer < 0):
    477         raise KeyError(f"not all values found in index {coord_name!r}")

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/indexes.py:263, in get_indexer_nd(index, labels, method, tolerance)
    259 def get_indexer_nd(index, labels, method=None, tolerance=None):
    260     """Wrapper around :meth:`pandas.Index.get_indexer` supporting n-dimensional
    261     labels
    262     """
--> 263     flat_labels = np.ravel(labels)
    264     flat_indexer = index.get_indexer(flat_labels, method=method, tolerance=tolerance)
    265     indexer = flat_indexer.reshape(labels.shape)

File <__array_function__ internals>:180, in ravel(*args, **kwargs)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/numpy/core/fromnumeric.py:1859, in ravel(a, order)
   1857     return asarray(a).ravel(order=order)
   1858 else:
-> 1859     return asanyarray(a).ravel(order=order)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/common.py:159, in AbstractArray.__array__(self, dtype)
    158 def __array__(self: Any, dtype: DTypeLike = None) -> np.ndarray:
--> 159     return np.asarray(self.values, dtype=dtype)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/dataarray.py:725, in DataArray.values(self)
    716 @property
    717 def values(self) -> np.ndarray:
    718     """
    719     The array's data as a numpy.ndarray.
    720 
   (...)
    723     type does not support coercion like this (e.g. cupy).
    724     """
--> 725     return self.variable.values

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/variable.py:559, in Variable.values(self)
    556 @property
    557 def values(self):
    558     """The variable's data as a numpy.ndarray"""
--> 559     return _as_array_or_item(self._data)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/core/variable.py:265, in _as_array_or_item(data)
    251 def _as_array_or_item(data):
    252     """Return the given values as a numpy array, or as an individual item if
    253     it's a 0d datetime64 or timedelta64 array.
    254 
   (...)
    263     TODO: remove this (replace with np.asarray) once these issues are fixed
    264     """
--> 265     data = np.asarray(data)
    266     if data.ndim == 0:
    267         if data.dtype.kind == "M":

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/array/core.py:1700, in Array.__array__(self, dtype, **kwargs)
   1699 def __array__(self, dtype=None, **kwargs):
-> 1700     x = self.compute()
   1701     if dtype and x.dtype != dtype:
   1702         x = x.astype(dtype)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/base.py:315, in DaskMethodsMixin.compute(self, **kwargs)
    291 def compute(self, **kwargs):
    292     """Compute this dask collection
    293 
    294     This turns a lazy Dask collection into its in-memory equivalent.
   (...)
    313     dask.base.compute
    314     """
--> 315     (result,) = compute(self, traverse=False, **kwargs)
    316     return result

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask/base.py:600, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    597     keys.append(x.__dask_keys__())
    598     postcomputes.append(x.__dask_postcompute__())
--> 600 results = schedule(dsk, keys, **kwargs)
    601 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py:3096, in Client.get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   3094         should_rejoin = False
   3095 try:
-> 3096     results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   3097 finally:
   3098     for f in futures.values():

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py:2265, in Client.gather(self, futures, errors, direct, asynchronous)
   2263 else:
   2264     local_worker = None
-> 2265 return self.sync(
   2266     self._gather,
   2267     futures,
   2268     errors=errors,
   2269     direct=direct,
   2270     local_worker=local_worker,
   2271     asynchronous=asynchronous,
   2272 )

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py:339, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    337     return future
    338 else:
--> 339     return sync(
    340         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    341     )

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py:406, in sync(loop, func, callback_timeout, *args, **kwargs)
    404 if error:
    405     typ, exc, tb = error
--> 406     raise exc.with_traceback(tb)
    407 else:
    408     return result

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py:379, in sync.<locals>.f()
    377         future = asyncio.wait_for(future, callback_timeout)
    378     future = asyncio.ensure_future(future)
--> 379     result = yield future
    380 except Exception:
    381     error = sys.exc_info()

File /srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/gen.py:762, in Runner.run(self)
    759 exc_info = None
    761 try:
--> 762     value = future.result()
    763 except Exception:
    764     exc_info = sys.exc_info()

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py:2157, in Client._gather(self, futures, errors, direct, local_worker)
   2155     else:
   2156         self._gather_future = future
-> 2157     response = await future
   2159 if response["status"] == "error":
   2160     log = logger.warning if errors == "raise" else logger.debug

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py:2208, in Client._gather_remote(self, direct, local_worker)
   2205                 response["data"].update(data2)
   2207     else:  # ask scheduler to gather data for us
-> 2208         response = await retry_operation(self.scheduler.gather, keys=keys)
   2210 return response

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils_comm.py:383, in retry_operation(coro, operation, *args, **kwargs)
    377 retry_delay_min = parse_timedelta(
    378     dask.config.get("distributed.comm.retry.delay.min"), default="s"
    379 )
    380 retry_delay_max = parse_timedelta(
    381     dask.config.get("distributed.comm.retry.delay.max"), default="s"
    382 )
--> 383 return await retry(
    384     partial(coro, *args, **kwargs),
    385     count=retry_count,
    386     delay_min=retry_delay_min,
    387     delay_max=retry_delay_max,
    388     operation=operation,
    389 )

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils_comm.py:368, in retry(coro, count, delay_min, delay_max, jitter_fraction, retry_on_exceptions, operation)
    366             delay *= 1 + random.random() * jitter_fraction
    367         await asyncio.sleep(delay)
--> 368 return await coro()

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py:1155, in PooledRPCCall.__getattr__.<locals>.send_recv_from_rpc(**kwargs)
   1153 prev_name, comm.name = comm.name, "ConnectionPool." + key
   1154 try:
-> 1155     return await send_recv(comm=comm, op=key, **kwargs)
   1156 finally:
   1157     self.pool.reuse(self.addr, comm)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py:920, in send_recv(comm, reply, serializers, deserializers, **kwargs)
    918 await comm.write(msg, serializers=serializers, on_error="raise")
    919 if reply:
--> 920     response = await comm.read(deserializers=deserializers)
    921 else:
    922     response = None

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py:241, in TCP.read(self, deserializers)
    239     self._closed = True
    240     if not sys.is_finalizing():
--> 241         convert_stream_closed_error(self, e)
    242 except BaseException:
    243     # Some OSError, CancelledError or a another "low-level" exception.
    244     # We do not really know what was already read from the underlying
    245     # socket, so it is not even safe to retry here using the same stream.
    246     # The only safe thing to do is to abort.
    247     # (See also GitHub #4133, #6548).
    248     self.abort()

File /srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py:144, in convert_stream_closed_error(obj, exc)
    142     raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
    143 else:
--> 144     raise CommClosedError(f"in {obj}: {exc}") from exc

CommClosedError: in <TLS (closed) ConnectionPool.gather local=tls://10.0.165.249:40336 remote=gateway://traefik-daskhub-dask-gateway.daskhub:80/daskhub.140774e7c7e04c8caf4604b5efb97c77>: Stream is closed

2023-04-27 07:33:09,197 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
2023-04-27 07:33:09,199 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
2023-04-27 07:33:09,200 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client

I don't understand this behaviour — neither the eager evaluation, nor why xarray feels like it needs to gather the entire time dataset in the scheduler or in the local client. There certainly seems to be an assumption in the code that the indexing can happen in memory — I spy an np.ravel in the backtrace — but I couldn't tell you why.

Is there a way to run this indexing in a parallelised fashion?

Thanks!

Answered by dcherian

Apr 30, 2023

This is not supported by dask. On main I get the nice error:

ValueError: Vectorized indexing with Dask arrays is not supported. 
Please pass a numpy array by calling ``.compute``. See https://github.com/dask/dask/issues/8958.

There certainly seems to be an assumption in the code that the indexing can happen in memory

The indexer needs to be in memory for vectorized indexing.

View full answer

dcherian · 2023-04-30T04:24:07Z

dcherian
Apr 30, 2023
Maintainer

This is not supported by dask. On main I get the nice error:

ValueError: Vectorized indexing with Dask arrays is not supported. 
Please pass a numpy array by calling ``.compute``. See https://github.com/dask/dask/issues/8958.

There certainly seems to be an assumption in the code that the indexing can happen in memory

The indexer needs to be in memory for vectorized indexing.

1 reply

SohumB Apr 30, 2023
Author

That is a nice error message! It would've saved me a full day of investigation, so I support its existence :)

I suppose I can understand that this is difficult in the general case; it's possible that the day → time map is mapping to a different block, and you might need to do a full shuffle.

But is there any way to do this without using vectorized indexing? I tried using apply_ufunc without really understanding it —

def nearest_time(dt, dts):
    # this is a bit annoying because we need to round up to match .sel(..., method='nearest')
    diff = np.abs(dts[::-1] - dt)
    i = len(dts) - np.argmin(diff) - 1
    return i

def select_by_time_on_numpy_arrays(time_indexes, data, time_values):
    # time_indexes, data, and time_values are all numpy arrays in memory here
    # we go through xr.DataArray just for the convenience functions
    data = xr.DataArray(data, dims=("lat", "lon", "date"))
    time_indexes = [[nearest_time(d, time_values) for d in ds] for ds in time_indexes.astype('datetime64[ns]')]
    time_indexes = xr.DataArray(time_indexes, dims=("lat", "lon"))
    return data.sel(date=time_indexes).values

def select_by_time(arr, time):
    return xr.apply_ufunc(select_by_time_on_numpy_arrays,
                          time, arr.groupby(arr.time.dt.date), arr.time.groupby(arr.time.dt.date), 
                          join="inner", input_core_dims=[[], ["time"], ["time"]], exclude_dims=set(["time"]), dask='parallelized', output_dtypes=arr.dtype)

select_by_time(foo, time)

The hope was that we could use dask='parallelized' to force xarray to realise that no, really, you can do all of the relevant computation within a single date's data chunks, and that this should be parallelisable.

But this still seems to want to do all of its computation eagerly, and if you hand it a distributed scheduler, you can see that it's not trying to send any of this work to the distributed task system. On my datasets, it takes about half an hour and ~5Gi of memory just to run select_by_time, and then hundreds of gigabytes later when I call compute() on it, all on the main node without any tasks being distributed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Vectorized indexing on two large arrays. #7791

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Vectorized indexing on two large arrays. #7791

Uh oh!

Uh oh!

SohumB Apr 27, 2023

Replies: 1 comment · 1 reply

Uh oh!

dcherian Apr 30, 2023 Maintainer

Uh oh!

Uh oh!

SohumB Apr 30, 2023 Author

SohumB
Apr 27, 2023

Replies: 1 comment 1 reply

dcherian
Apr 30, 2023
Maintainer

SohumB Apr 30, 2023
Author