`smart_open()`: dynamically improving fsspec caching strategy. #527

betolink · 2024-04-18T20:55:58Z

betolink
Apr 18, 2024
Maintainer

One of the core features in earthaccess is accessing remote files without having to download them with earthaccess.open(). Under the hood we are using fsspec. The default cache is called read-ahead, this works fine for text files (e.g. when we are read contiguous lines of text) but... read-ahead is very inefficient for scientific data (HDF/NetCDF).

What are the alternatives? Fortunately fsspec has different caching strategies and 2 of them seem better options for us in the short term: blockcache and first, a third caching implementation (KnownPartsOfAFile) could improve this even further (down the road).

Block Cache: Cache holding memory as a set of blocks. Requests are only ever made blocksize at a time, and are stored in an LRU cache. The least recently accessed block is discarded when more than maxblocks are stored.
First Chunk: Caches the first block of a file only. This may be useful for file types where the metadata is stored in the header, but is randomly accessed.
KnownPartsOfAFile: Caches predetermined known parts of a file, this would be optimal if we know where the metadata and data is, at the moment the only way to know this is via the .dmrpp sidecar files if they are available or opening a file and inspecting its structure (very inefficient). If we write a parser for .dmrpp or have this information available at the metadata level earthaccess could improve what it caches in a very efficient way.

I've tested the first 2 implementations with data from several missions and the improvements in access times and data transfers are very promising, in some cases an order of magnitude improvement.

Proposal: earthaccess should use one of these 2 caching strategies depending on the file type, we should adjust the cache size to a percentage of the granule size and down the road we can work on further optimizations if some relevant information about the chunking of a given dataset is available to us via CMR/STAC.

To avoid complications with changing or breaking the API we can start prototyping this on a top level smart_open() method as we talked about with @itcarroll and @chuckwondo.

mfisher87 · 2024-04-23T04:53:17Z

mfisher87
Apr 23, 2024
Maintainer

Proposal: earthaccess should use one of these 2 caching strategies depending on the file type, we should adjust the cache size to a percentage of the granule size and down the road we can work on further optimizations if some relevant information about the chunking of a given dataset is available to us via CMR/STAC.

I like the idea. Until we have chunking info available as metadata, we should allow users to override our percentage-based calculation with whatever value they want in case they want to optimize for their particular dataset.

0 replies

betolink · 2024-04-25T02:00:14Z

betolink
Apr 25, 2024
Maintainer Author

A dev note: I'm running into an interesting behavior with xarray and dask distributed, recently a PR got merged into fsspec to track I/O and caching behavior. These changes are not released yet but the following will work if we install fsspec from source. pip install git+https://github.com/fsspec/filesystem_spec.git@master

import earthaccess
import xarray as xr

auth = earthaccess.login()
results = earthaccess.search_data(short_name="MUR25-JPL-L4-GLOB-v04.2", count=2)

# I'm testing it with fileset = earthaccess.open(results, smart_open=True) but this code is not in earthaccess yet
fileset = earthaccess.open(results)
ds = xr.open_dataset(fileset[0], engine="h5netcdf")

# now we can inspect fileset[0] to track the IO and caching stats

fileset[0].cache

We will get an output like:

        <BlockCache:
            block size  :   102400
            block count :   19
            file size   :   1878361
            cache hits  :   75
            cache misses:   1
            total requested bytes: 102400>

However, if we run open_mfdataset(fileset, parallel=True) meaning with Dask, we loose the context for the file-like objects. I suspect that internally xarray is passing the references to dask distributed and either dask or xarray creates copies of the file-like objects.. maybe this is related to dask/dask#3255 the end result is that when we use a dask cluster the cache stats in each of the original fileset list remains unnafected by the reads in xarray. The speed/efficiency definitely is better with smart_open but it will be great to also keep track of the overall I/O within xr.open_mfdataset(parallel=True)

# this should use a Dask cluster if we have one.
ds = xr.open_mfdataset(fileset,
                       engine="h5netcdf",
                       compat="override",
                       coords="minimal",
                       parallel=True)

and now if we inspect fileset[0] (or any one from the list) we get

        <BlockCache:
            block size  :   102400
            block count :   19
            file size   :   1895805
            cache hits  :   0
            cache misses:   0
            total requested bytes: 0>

meaning our actual file-like object hasn't been used internally by xarray/dask. I'll keep debugging this and open an issue/discussion in xarray when I get more information on what's happening. Maybe @dcherian has a better idea of what may be happening here.

5 replies

dcherian Apr 25, 2024

What kind of cluster are you running on? With distributed (and processes too I guess), there is a serialize-deserialize step, in that case you're right: the object you hold is NOT the object Xarray is reading from.

That said for stats purposes you could run serially with scheduler="sync" and still get accurate numbers (just slowly).

betolink Apr 25, 2024
Maintainer Author

I'm using a distributed cluster, I'll test with scheduler="sync". One thing I was hoping to achieve is to keep track of the I/O at the file level but seems like not a trivial thing to accomplish with dask distributed. 🤔

betolink Apr 29, 2024
Maintainer Author

I forgot, maybe @martindurant knows if there is a way to access one of these AsyncFileSystem references after they get into the Dask machinery (serialized, de-serialized) on a distributed cluster. I couldn't find an obvious way of pausing the computation and inspecting what each worker has in memory.

betolink Apr 29, 2024
Maintainer Author

As a non related thing, with this caching and I/O stats now available in fsspec it would be great to have a Dask panel to summarize this information, similar to the ones we have for other things like workers memory, networking etc.

martindurant Apr 29, 2024

I agree that such a panel would be nice, but I don't think there's anything for that now.

dcherian · 2024-04-25T14:19:14Z

dcherian
Apr 25, 2024

cc @kmuehlbauer, it might be interesting to revive the fsspec in h5netcdf convo so that we can set smarter defaults.

0 replies

smart_open(): dynamically improving fsspec caching strategy. #527

Uh oh!

Uh oh!

betolink Apr 18, 2024 Maintainer

Replies: 3 comments · 5 replies

Uh oh!

mfisher87 Apr 23, 2024 Maintainer

Uh oh!

Uh oh!

betolink Apr 25, 2024 Maintainer Author

Uh oh!

dcherian Apr 25, 2024

Uh oh!

betolink Apr 25, 2024 Maintainer Author

Uh oh!

betolink Apr 29, 2024 Maintainer Author

Uh oh!

betolink Apr 29, 2024 Maintainer Author

Uh oh!

martindurant Apr 29, 2024

Uh oh!

dcherian Apr 25, 2024

`smart_open()`: dynamically improving fsspec caching strategy. #527

betolink
Apr 18, 2024
Maintainer

Replies: 3 comments 5 replies

mfisher87
Apr 23, 2024
Maintainer

betolink
Apr 25, 2024
Maintainer Author

betolink Apr 25, 2024
Maintainer Author

betolink Apr 29, 2024
Maintainer Author

betolink Apr 29, 2024
Maintainer Author

dcherian
Apr 25, 2024