Time series look-up from a large Zarr file #7131

justurbo · 2022-10-05T08:02:44Z

justurbo
Oct 5, 2022

I want to look up 8760 times for a single lat/lon combo in less than a second from a 43.82 GB file of wind data containing:

8760 times (every hour in a year)
721 latitudes (every 0.25° from -90.0° to 90.0°)
1440 longitude (every 0.25° from -180.0° to 179.75°)

The best time we achieved for a single-year look-up was 16 seconds for both u100 and v100 wind speed at 100m vectors. We want to have a sub-second look-up for the whole year as such file read will need to happen on every user request in our API.

if __name__ == '__main__':
    start_time = time.time()

    ds = xr.open_dataset("2021.zarr", engine="zarr", chunks={"time": 50})

    print(f"Took {round((time.time() - start_time) * 1000, 2)}ms")

    location = ds.sel(indexers={"latitude": 53.494, "longitude": 9.979}, method='nearest')

    wind_speed = (location.u100.values ** 2 + location.v100.values ** 2) ** 0.5

    print(f"Wind Speed: {wind_speed} m/s")
    print(f"Took {round((time.time() - start_time) * 1000, 2)}ms")

Output:

Took 94.28ms
Wind Speed: [5.8021994 5.504477  5.4270387 ... 9.563195  8.701231  9.133655 ] m/s
Took 16299.59ms

Any help would be really appreciated. Thanks!

Answered by shoyer

Oct 6, 2022

Your data is chunked with Zarr into blocks of size (50, 721, 1440), which means that every request data for the entire world. This means every look up loads the full 42 GB of data, stored in several hundred files!

To enable efficient queries of your data, you will need to "rechunk" it so you can query data a single location with less waste, e.g., by chunking along latitude and longitude. You can do this with .chunk() and writing a new Zarr file in xarray, or with a tool like Rechunker.

View full answer

dcherian · 2022-10-06T21:53:57Z

dcherian
Oct 6, 2022
Maintainer

What does it look like if you remove dask from the equation using

ds = xr.open_dataset("2021.zarr", engine="zarr")

0 replies

shoyer · 2022-10-06T22:13:15Z

shoyer
Oct 6, 2022
Maintainer

Your data is chunked with Zarr into blocks of size (50, 721, 1440), which means that every request data for the entire world. This means every look up loads the full 42 GB of data, stored in several hundred files!

To enable efficient queries of your data, you will need to "rechunk" it so you can query data a single location with less waste, e.g., by chunking along latitude and longitude. You can do this with .chunk() and writing a new Zarr file in xarray, or with a tool like Rechunker.

1 reply

JustasUrbonas Oct 7, 2022

Thank you! 🚀 🚀 🚀

The problem is that the chunks specified on xarray open_dataset are unrelated to the chunking on the file. After saving to a new zarr file the file size increased to 57.6 GB but the look-up is now <1ms!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Time series look-up from a large Zarr file #7131

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Time series look-up from a large Zarr file #7131

Uh oh!

justurbo Oct 5, 2022

Replies: 2 comments · 1 reply

Uh oh!

dcherian Oct 6, 2022 Maintainer

Uh oh!

shoyer Oct 6, 2022 Maintainer

Uh oh!

JustasUrbonas Oct 7, 2022

justurbo
Oct 5, 2022

Replies: 2 comments 1 reply

dcherian
Oct 6, 2022
Maintainer

shoyer
Oct 6, 2022
Maintainer