ERA5-Utils

Utilities to preprocess ERA5 forecast data into clean, consistent formats for modeling, machine learning, or climate analysis.

ERA5-Utils is a toolkit for downloading, processing, and converting 4D ERA5 reanalysis forecast data into standardized formats for use in modeling, atmospheric simulations, and reinforcement learning environments. It supports ECMWF ERA5 reanalysis forecasts in both pressure-level and model-level data (Complete ERA5) and includes utilities for converting hybrid sigma-level NetCDF/GRIB files to pressure or altitude levels. Designed for geoscience workflows, this package automates and streamlines common preprocessing steps using ECMWF's CDS API, Climate Data Operators (CDO), and Python tools like xarray. Currently focused on atmospheric data, this repo supports preprocessing of temperature, humidity, wind, and geopotential fields for use in Earth system models, stratospheric balloon simulations, and custom analysis workflows. In the future we would like to add support for GFS and ICON

Some of our simulation frameworks that use this include:

ℹ️ Note From September 26, 2024, the legacy CDS and legacy ADS are decommissioned and no longer accessible. CDS-Beta and ADS-Beta have officially become the new CDS and the new ADS. The new CDS and ADS also use an updated GRIB to netCDF conversion, which causes slight differences in both formatting and data. An ERA5 Reanalysis forecast downloaded for identical regions Pre and Post Sep 2024, will have slightly different variable numbers as well as some differences in strucure.

Installation

This package is designed to work on Ubuntu, WSL, or any Unix-like system with climate data tools.

pip3 install -r requirements.txt

Additional command line netcdf tools to install on local machine:

conda install cdsapi
sudo apt-get install netcdf-bin
sudo apt-get install nco
Climate Data Operator (CDO) command line tool
- CDO User Guide

ERA5 Reanalysis Data Structures

There are three primary sources for downloading ERA5 Data:

Pre Sep-2024 Climate Data Store
Post Sep-2024 Climate Data Store
Complete ERA5 data on model levels
- Also see: (https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation)

Each data source formats and structures the multi-dimensional data in slightly different ways. We prefer the Pre-Sep 24 format, as many of our other codebases were built using this structure, it's faster, and doesn't include any unecessary coordinates or variables.

Data Structure Variations between Data sources:

The following outputs in command line were generated using explore_netcdf.py (Note: the screenshots below are not from the exact same regions/timeframes).

Pre Sep 2024 ERA5 data structure:

Post Sep 2024 ERA5 data structure:

ERA5-Complete data structure:

Differences between Pre and Post September 2024:

Post introduces expver and number coordinates which we remove. (see below what these do)
time was renamed to valid_time
level was renamed to pressure_level and changed from type:int32 to type:float64
latitude and longtiude were changed from type:float32 to type:float64
Data variables were changed from type:float64 to type:float32
A hidden difference is that Pre-Sep-24-ERA5 is type:64bit offset format while Post-Sep-24-ERA5 is type:NETCDF4
- This can be checked using ncdump -k your_file.nc
- NetCDF-3 (64-bit offset) stores data contiguously, typically making sequential reads faster. Whereas NetCDF-4 chunks data, meaning that accessing small portions of large datasets can be slower due to extra read operations.
See Unidata for differences in backend netcdf filetypes

(I thought there was a difference in Lat order and/or Long degrees?)

Differences between Pre September 2024 and processed Complete ERA5 converted to pressure levels (see tutorial below):

level was renamed to plev and changed from type:int32 to type:float64
level is 100 x larger (mb to Pa)
latitude and longtiude were renamed to lat and lon
latitude and longtiude were changed from type:float32 to type:float64
Data variables were changed from type:float64 to type:float32
longtiude range is changed from (-180, 180) to (0,360)

ℹ️ Note on removed variables expver and number:

expver stands for "experiment version" and is used by ECMWF to differentiate between different dataset versions. In ERA5, this typically indicates whether the data comes from the main ERA5 dataset (expver=0001) or the back extension (expver=0005), which covers earlier years before 1979.

The number coordinate is used in ensemble datasets where multiple forecasts are run with slightly different initial conditions. In ERA5, number=0 typically refers to the deterministic (high-resolution) reanalysis. If you're using the ERA5 ensemble mean (ERA5T), number will range from 0 to 9 (10 ensemble members).

Convert Post and Complete to Pre Sep 2024 netcdf4 formatting

If ERA5 Post Sep 2024 on pressure levels has been downloaded, run post2pre.py and change the filename arguments.

If ERA5-Complete is already download and converted to pressure levels (See below on tutorial), run complete2pre.py and change the filename arguments.

Downloading ERA5 Reanalysis on pressure levels:

Download ERA5 hourly data on pressure levels from 1940 to present from the ECMWF CDS with desired coordinates and variables.

Downloading Complete ERA5 Reanalysis on Model Levels and converting to pressure levels:

The complete ERA5 reanalysis on model levels provides atmospheric data on the model’s native hybrid sigma-pressure vertical coordinate system, offering higher vertical resolution and more physically consistent fields, particularly near the surface and tropopause. Unlike pressure level data—which is interpolated post-processing onto standard pressure surfaces (e.g., 850 hPa, 500 hPa)—model level data is reported on 137 hybrid levels that transition from terrain-following sigma coordinates near the surface to pressure-based levels higher in the atmosphere. This hybrid system more accurately represents topographically complex regions and ensures mass conservation in vertical transport, making it especially valuable for research involving vertical dynamics, surface–atmosphere interactions, and numerical modeling.

However, the hybrid-sigma structure is not good for altitude/depth based simulations or analysisis. So we convert model levels back to pressure levels or altitude levels, which requires at minimum the following 4 variables and 2 different output files:

z (geopotential) and lnsp (log surface pressure) only at surface level (model level 1)
t (temperature), q (humidity), and any other desired params at all model levels of interest (can be 1-137)

ℹ️ Note While all 4 of these variables can technically be downloaded at the same time, issues arise durring netcdf processing altitude or pressure levels since params z and lnsp only contains data at model level 1.

Before downloading, update config.py with desired params (see provided file for examples)

tp should be an for analysis or "fc" for a forward forecast
date can be an individual date or range of dates
time can be 1 hour increments for reanalysis" and either "00:00:00/12:00:00" for forward forecasts
grid
- 0.25° (HRES)
- 0.1° (ERA5-Land)
- 0.5° (ERA5 Ensemble [EDA])
- You can technically request coarser resolution (e.g. 1.0° x 1.0°),but not higher than the native grid.
area [North, West, South, East] lat (-90 to 90), lon (-180, 180)
levellist 1-137 (all levels required to have embedded hybrid sigma information)
param See ERA5 data documentation for available param short names
step up to 0-18 hours for forward forecasts

Additional Resources from ECMWF

Produce and verify the MARS CDS API request
Model Level Definitions
L137 Model Level Definitions

compute_geopotenial_on_ml_updated.py and conversion_from_ml_to_pl_updated.py have both been updated from the scripts provided by ECMWF to compute geopotential on model levels which both have reported bugs.

We have found the output of conversion_from_ml_to_pl_updated.py to be less accurate when compared against ECMWF reanalysis forecasts on pressure levels, and also significantly slower to calculate than using CDO's ml2pl command.

Bash Scripts for Easy Monthly and Annual Processing:

process-complete-ERA5.sh gives a quick example on how to download and convert raw complete ERA5 reanalysis on model levels to higher resolution pressure levels. Alternatively cdo's ml2al command can be used in ml2pl place of can be used to convert to altitude instead of pressure levels.

aggregate-annual-complete-ERA5.sh runs process-complete-ERA5.sh for every month in a user defined year to make a final annual Complete ERA5, outputting in both grib and netcdf format

Steps for the bash script (per month due to 10Gb download limit)

Download data. Set the config params first (download_Complete_ERA5.py)
compute geopotential (python3 compute_geopotential_on_ml_updated.py tq_ml2.grib zlnsp_ml.grib -o z.grib)
remove z from lnsp before merge, since only at level 1, and not to conflict with new z (cdo delname,z zlnsp_ml.grib lnsp.grib)
Merge z, lnsp, and tq_ml2 for monthly forecast with all variables (cdo merge tq_ml2.grib z.grib lnsp.grib Jan-2022.grib)
Convert to new user_defined pressure levels (cdo ml2l...)
Remove uncessary variables (cdo delname,q,t,lnsp Jan-2022_pres_temp.grib Jan-2022_pres.grib)
Remove all intermediate processing files. (rm Jan-2022_pres_temp.grib,z.grib, lnsp.grib)

Additional Scripts for fomatting and proccessing:

check_corruption.py Gives an example on how to check if a netcdf4 forecast is corrupted. Often, when a netcdf4 file is corrupted, the file can still be imported properly and not throw any errors, but causes problems later hwne trying to access the data.
explore_netcdf.py examine the output of a processed netcdf using python's xarray library
split.py gives an example on how to subset a larger forecast to a smaller region and/or timescale

Authors

Tristan Schuler - U.S. Naval Research Laboratory

Tips/Tricks

Grib files are smaller than netcdf file, by almost half.
The hybrid sigma variables are only present if full model level list is downloaded (potentially will work if surface pressure is included for a subset of levels?)
You can't combine multi level geopotential data (z) with a single surface level z file, they interfere

TODO:

Clean and Fix Forward Forecast processing
Write a script that autochecks differences between formats
GFS Support

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
complete_era5_data		complete_era5_data
img		img
reanalysis_data		reanalysis_data
.gitignore		.gitignore
LICENSE		LICENSE
aggregate-annual-complete-ERA5.sh		aggregate-annual-complete-ERA5.sh
check_corruption.py		check_corruption.py
complete2pre.py		complete2pre.py
compute_geopotential_on_ml_updated.py		compute_geopotential_on_ml_updated.py
config.py		config.py
conversion_from_ml_to_pl_updated.py		conversion_from_ml_to_pl_updated.py
download_Complete_ERA5-forward.py		download_Complete_ERA5-forward.py
download_Complete_ERA5.py		download_Complete_ERA5.py
explore_netcdf.py		explore_netcdf.py
forward-process-complete-ERA5.sh		forward-process-complete-ERA5.sh
forward-process-monthly.sh		forward-process-monthly.sh
modify_forward_forecast.py		modify_forward_forecast.py
post2pre.py		post2pre.py
process-complete-ERA5.sh		process-complete-ERA5.sh
readme.md		readme.md
requirements.txt		requirements.txt
split.py		split.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ERA5-Utils

Installation

ERA5 Reanalysis Data Structures

Data Structure Variations between Data sources:

Differences between Pre and Post September 2024:

Differences between Pre September 2024 and processed Complete ERA5 converted to pressure levels (see tutorial below):

Convert Post and Complete to Pre Sep 2024 netcdf4 formatting

Downloading ERA5 Reanalysis on pressure levels:

Downloading Complete ERA5 Reanalysis on Model Levels and converting to pressure levels:

Additional Resources from ECMWF

Bash Scripts for Easy Monthly and Annual Processing:

Additional Scripts for fomatting and proccessing:

Authors

Tips/Tricks

TODO:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

tkschuler/ERA5-Utils

Folders and files

Latest commit

History

Repository files navigation

ERA5-Utils

Installation

ERA5 Reanalysis Data Structures

Data Structure Variations between Data sources:

Differences between Pre and Post September 2024:

Differences between Pre September 2024 and processed Complete ERA5 converted to pressure levels (see tutorial below):

Convert Post and Complete to Pre Sep 2024 netcdf4 formatting

Downloading ERA5 Reanalysis on pressure levels:

Downloading Complete ERA5 Reanalysis on Model Levels and converting to pressure levels:

Additional Resources from ECMWF

Bash Scripts for Easy Monthly and Annual Processing:

Additional Scripts for fomatting and proccessing:

Authors

Tips/Tricks

TODO:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages