Replies: 3 comments 5 replies
-
I like the idea. Until we have chunking info available as metadata, we should allow users to override our percentage-based calculation with whatever value they want in case they want to optimize for their particular dataset. |
Beta Was this translation helpful? Give feedback.
-
A dev note: I'm running into an interesting behavior with xarray and dask distributed, recently a PR got merged into import earthaccess
import xarray as xr
auth = earthaccess.login()
results = earthaccess.search_data(short_name="MUR25-JPL-L4-GLOB-v04.2", count=2)
# I'm testing it with fileset = earthaccess.open(results, smart_open=True) but this code is not in earthaccess yet
fileset = earthaccess.open(results)
ds = xr.open_dataset(fileset[0], engine="h5netcdf")
# now we can inspect fileset[0] to track the IO and caching stats
fileset[0].cache We will get an output like: <BlockCache:
block size : 102400
block count : 19
file size : 1878361
cache hits : 75
cache misses: 1
total requested bytes: 102400> However, if we run # this should use a Dask cluster if we have one.
ds = xr.open_mfdataset(fileset,
engine="h5netcdf",
compat="override",
coords="minimal",
parallel=True) and now if we inspect <BlockCache:
block size : 102400
block count : 19
file size : 1895805
cache hits : 0
cache misses: 0
total requested bytes: 0> meaning our actual file-like object hasn't been used internally by xarray/dask. I'll keep debugging this and open an issue/discussion in xarray when I get more information on what's happening. Maybe @dcherian has a better idea of what may be happening here. |
Beta Was this translation helpful? Give feedback.
-
cc @kmuehlbauer, it might be interesting to revive the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
One of the core features in
earthaccess
is accessing remote files without having to download them withearthaccess.open()
. Under the hood we are using fsspec. The default cache is called read-ahead, this works fine for text files (e.g. when we are read contiguous lines of text) but... read-ahead is very inefficient for scientific data (HDF/NetCDF).What are the alternatives? Fortunately fsspec has different caching strategies and 2 of them seem better options for us in the short term:
blockcache
andfirst
, a third caching implementation (KnownPartsOfAFile) could improve this even further (down the road)..dmrpp
sidecar files if they are available or opening a file and inspecting its structure (very inefficient). If we write a parser for.dmrpp
or have this information available at the metadata level earthaccess could improve what it caches in a very efficient way.I've tested the first 2 implementations with data from several missions and the improvements in access times and data transfers are very promising, in some cases an order of magnitude improvement.
Proposal: earthaccess should use one of these 2 caching strategies depending on the file type, we should adjust the cache size to a percentage of the granule size and down the road we can work on further optimizations if some relevant information about the chunking of a given dataset is available to us via CMR/STAC.
To avoid complications with changing or breaking the API we can start prototyping this on a top level
smart_open()
method as we talked about with @itcarroll and @chuckwondo.Beta Was this translation helpful? Give feedback.
All reactions