Retrieving data to local storage without loading it into memory #7371

observingClouds · 2022-12-09T00:00:40Z

observingClouds
Dec 9, 2022

Hi there 👋,
I was wondering if there is a way to cache data locally without loading it directly into memory as methods like load, persist or compute do.

Why I like to know this?

There are two cases I can think about:

high latency filesystems, where it takes a long time to retrieve data and compute resources for retrieving data might meet different requirements than the actual computation. One example for such a filesystem would be a tape-archive.
caching data for later, when there is no internet access and the dataset cannot be loaded on-the-fly at a later point in time.

Workflow to achieve similar goal, but not quite what I want

import xarray as xr
ds=xr.tutorial.load_dataset("air_temperature")
ds.isel(time=slice(0,100)).to_zarr("~/cache_folder/air_temperature.zarr")

This example has the issue that

extending the dataset is not straight-forward if different time-slices should be cached at a later time
the format of the data can change
the dataset is no longer connected with the original dataset, potentially loosing its reference

What I like to have

url = "https://huggingface.co/datasets/openclimatefix/era5-reanalysis/resolve/main/data/surface/2022/01/20220101.zarr.zip"
url_chain = f"simplecache::zip:///::{url}"
ds = xr.open_dataset(url_chain, engine="zarr", chunks={})
data_of_interest = ds.isel(time=0).z
data_of_interest.retrieve()  # anticipated function that retrieves the data from the url chain, but does not load it into memory.
data_of_interest.mean().compute()  # would work as usual, but because all data is already in the local cache, the computation works without internet access

Curious to hear if there are already some solution/hacks or in which direction I should look into.

dcherian · 2022-12-12T15:54:13Z

dcherian
Dec 12, 2022
Maintainer

pinging @martindurant @andersy005 for ideas

0 replies

martindurant · 2022-12-12T16:38:34Z

martindurant
Dec 12, 2022

What is the problem with

data_of_interest.mean().compute()

?
Is it, that you want a check on the original data to see whether it has changed? There are some options to "simplecache", particularly expiry_time that can force a refetch. We haven't implemented a way to check the original for a change, but this is planned.

1 reply

observingClouds Dec 12, 2022
Author

The issue is that data retrievals from e.g. a tape archive take a lot of time and do not need a lot of compute resources. .mean().compute() on the other hand can require a lot of resources which would be mostly idle during the tape retrieval (and in a HPC environment unavailable for other users). Separating the retrieval of data from the computation/loading into memory, would allow to allocate for both tasks only the necessary resources.

martindurant · 2022-12-12T18:29:25Z

martindurant
Dec 12, 2022

I will think about it. I don't think there's a way to not access the data and use dask worker threads if you are working via the xarray interface. You could, however, have larger dask partitions, so that at least each task is waiting on more chunks at once, but it will take transiently up memory whatever you then do with that partition (e.g., your mean call).

Using the the filesystem interface directly you have some more options. You could choose to .get() files to a local copy explicitly. Alternatively, if you wish to use the cache URL in the future and want the same filename hashing, you can make exactly the same filesystem and get the first byte of each file to trigger the download.

fs, _ = fsspec.core.url_to_fs(url_chain)
allfiles = fs.find("")
fs.cat_ranges(allfiles, 0, 1)

Should work, or, better, break the file list into batches.

Also, I notice that the data of the one file I checked is uncompressed so this might be of interest.

1 reply

martindurant Dec 12, 2022

(I mean, zarr compressed it, not ZIP)

observingClouds · 2022-12-13T07:21:04Z

observingClouds
Dec 13, 2022
Author

Thanks for sharing your thoughts @martindurant. I appreciate it. The get-method is basically what I would like to call from within xarray to make it convenient to retrieve only chunks of interest. The mentioned file is just an example which I would access via kerchunk/references as you suggested when it is stored in the cloud as it currently is. Imagine however, this file and several others of this kind would be on a tape archive and form a joined dataset. In that case I would need to retrieve an entire zip-file to a local folder or cache, before it can be accessed in any way. Because this process will take a lot of time and not many resources, I would like to initialise this before loading the data into memory.

Is there a way to see which byte-ranges and files each Dask-chunk belongs to from within xarray? This might allow me to write an accessor to achieve the retrieval of data chunks without loading them into memory.

1 reply

martindurant Dec 14, 2022

The function for dereferencing into a ZIP archive from kerchunk basically does exactly what you want, but it is a bit specific to kerchunk's use. You could make a kerchunk reference set of your zarr-in-ZIP for this purpose, or just rip a little of the code out of there to find the specific byte ranges of each of the contained files. I don't know of a way to figure out which chunks/keys xarray will be accessing. You could probably make a fake store that returns nothing (i.e, raise KeyError for all data keys and all values will appear to be NaN) but records which keys it tried to load. This is all a bit manual, but if you come up with a workflow that's useful to you, it might be generalisable for others.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Retrieving data to local storage without loading it into memory #7371

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Retrieving data to local storage without loading it into memory #7371

Uh oh!

observingClouds Dec 9, 2022

Why I like to know this?

Workflow to achieve similar goal, but not quite what I want

What I like to have

Replies: 4 comments · 3 replies

Uh oh!

dcherian Dec 12, 2022 Maintainer

Uh oh!

martindurant Dec 12, 2022

Uh oh!

Uh oh!

observingClouds Dec 12, 2022 Author

Uh oh!

martindurant Dec 12, 2022

Uh oh!

martindurant Dec 12, 2022

Uh oh!

observingClouds Dec 13, 2022 Author

Uh oh!

martindurant Dec 14, 2022

observingClouds
Dec 9, 2022

Replies: 4 comments 3 replies

dcherian
Dec 12, 2022
Maintainer

martindurant
Dec 12, 2022

observingClouds Dec 12, 2022
Author

martindurant
Dec 12, 2022

observingClouds
Dec 13, 2022
Author