Utilities to preprocess ERA5 forecast data into clean, consistent formats for modeling, machine learning, or climate analysis.
ERA5-Utils
is a toolkit for downloading, processing, and converting 4D ERA5 reanalysis forecast data into standardized formats for use in modeling, atmospheric simulations, and reinforcement learning environments. It supports ECMWF ERA5 reanalysis forecasts in both pressure-level and model-level data (Complete ERA5) and includes utilities for converting hybrid sigma-level NetCDF/GRIB files to pressure or altitude levels. Designed for geoscience workflows, this package automates and streamlines common preprocessing steps using ECMWF's CDS API, Climate Data Operators (CDO), and Python tools like xarray. Currently focused on atmospheric data, this repo supports preprocessing of temperature, humidity, wind, and geopotential fields for use in Earth system models, stratospheric balloon simulations, and custom analysis workflows. In the future we would like to add support for GFS and ICON
Some of our simulation frameworks that use this include:
ℹ️ Note From September 26, 2024, the legacy CDS and legacy ADS are decommissioned and no longer accessible. CDS-Beta and ADS-Beta have officially become the new CDS and the new ADS. The new CDS and ADS also use an updated GRIB to netCDF conversion, which causes slight differences in both formatting and data. An ERA5 Reanalysis forecast downloaded for identical regions Pre and Post Sep 2024, will have slightly different variable numbers as well as some differences in strucure.
This package is designed to work on Ubuntu, WSL, or any Unix-like system with climate data tools.
pip3 install -r requirements.txt
Additional command line netcdf tools to install on local machine:
- conda install cdsapi
sudo apt-get install netcdf-bin
sudo apt-get install nco
- Climate Data Operator (CDO) command line tool
There are three primary sources for downloading ERA5 Data:
- Pre Sep-2024 Climate Data Store
- Post Sep-2024 Climate Data Store
- Complete ERA5 data on model levels
Each data source formats and structures the multi-dimensional data in slightly different ways. We prefer the Pre-Sep 24 format, as many of our other codebases were built using this structure, it's faster, and doesn't include any unecessary coordinates or variables.
The following outputs in command line were generated using explore_netcdf.py
(Note: the screenshots below are not from the exact same regions/timeframes).
Pre Sep 2024 ERA5 data structure:
Post Sep 2024 ERA5 data structure:
ERA5-Complete data structure:
- Post introduces
expver
andnumber
coordinates which we remove. (see below what these do) time
was renamed tovalid_time
level
was renamed topressure_level
and changed fromtype:int32
totype:float64
latitude
andlongtiude
were changed fromtype:float32
totype:float64
- Data variables were changed from
type:float64
totype:float32
- A hidden difference is that Pre-Sep-24-ERA5 is
type:64bit offset
format while Post-Sep-24-ERA5 istype:NETCDF4
- This can be checked using
ncdump -k your_file.nc
NetCDF-3
(64-bit offset) stores data contiguously, typically making sequential reads faster. WhereasNetCDF-4
chunks data, meaning that accessing small portions of large datasets can be slower due to extra read operations.
- This can be checked using
- See Unidata for differences in backend netcdf filetypes
(I thought there was a difference in Lat order and/or Long degrees?)
Differences between Pre September 2024 and processed Complete ERA5 converted to pressure levels (see tutorial below):
level
was renamed toplev
and changed fromtype:int32
totype:float64
level
is 100 x larger (mb to Pa)latitude
andlongtiude
were renamed tolat
andlon
latitude
andlongtiude
were changed fromtype:float32
totype:float64
- Data variables were changed from
type:float64
totype:float32
longtiude
range is changed from (-180, 180) to (0,360)
ℹ️ Note on removed variables
expver
andnumber
:
expver
stands for "experiment version" and is used by ECMWF to differentiate between different dataset versions. In ERA5, this typically indicates whether the data comes from the main ERA5 dataset (expver=0001
) or the back extension (expver=0005
), which covers earlier years before 1979.- The
number
coordinate is used in ensemble datasets where multiple forecasts are run with slightly different initial conditions. In ERA5,number=0
typically refers to the deterministic (high-resolution) reanalysis. If you're using the ERA5 ensemble mean (ERA5T),number
will range from 0 to 9 (10 ensemble members).
If ERA5 Post Sep 2024 on pressure levels has been downloaded, run post2pre.py
and change the filename arguments.
If ERA5-Complete is already download and converted to pressure levels (See below on tutorial), run complete2pre.py
and change the filename arguments.
Download ERA5 hourly data on pressure levels from 1940 to present from the ECMWF CDS with desired coordinates and variables.
The complete ERA5 reanalysis on model levels provides atmospheric data on the model’s native hybrid sigma-pressure vertical coordinate system, offering higher vertical resolution and more physically consistent fields, particularly near the surface and tropopause. Unlike pressure level data—which is interpolated post-processing onto standard pressure surfaces (e.g., 850 hPa, 500 hPa)—model level data is reported on 137 hybrid levels that transition from terrain-following sigma coordinates near the surface to pressure-based levels higher in the atmosphere. This hybrid system more accurately represents topographically complex regions and ensures mass conservation in vertical transport, making it especially valuable for research involving vertical dynamics, surface–atmosphere interactions, and numerical modeling.
However, the hybrid-sigma structure is not good for altitude/depth based simulations or analysisis. So we convert model levels back to pressure levels or altitude levels, which requires at minimum the following 4 variables and 2 different output files:
- z (geopotential) and lnsp (log surface pressure) only at surface level (model level 1)
- t (temperature), q (humidity), and any other desired params at all model levels of interest (can be 1-137)
ℹ️ Note While all 4 of these variables can technically be downloaded at the same time, issues arise durring netcdf processing altitude or pressure levels since params z and lnsp only contains data at model level 1.
Before downloading, update config.py
with desired params (see provided file for examples)
- tp should be an for analysis or "fc" for a forward forecast
- date can be an individual date or range of dates
- time can be 1 hour increments for reanalysis" and either "00:00:00/12:00:00" for forward forecasts
- grid
- 0.25° (HRES)
- 0.1° (ERA5-Land)
- 0.5° (ERA5 Ensemble [EDA])
- You can technically request coarser resolution (e.g. 1.0° x 1.0°),but not higher than the native grid.
- area [North, West, South, East] lat (-90 to 90), lon (-180, 180)
- levellist 1-137 (all levels required to have embedded hybrid sigma information)
- param See ERA5 data documentation for available param short names
- step up to 0-18 hours for forward forecasts
- Produce and verify the MARS CDS API request
- Model Level Definitions
- L137 Model Level Definitions
compute_geopotenial_on_ml_updated.py
and conversion_from_ml_to_pl_updated.py
have both been updated from the scripts provided by ECMWF to compute geopotential on model levels which both have reported bugs.
We have found the output of conversion_from_ml_to_pl_updated.py
to be less accurate when compared against ECMWF reanalysis forecasts on pressure levels, and also significantly slower to calculate than using CDO's ml2pl command.
process-complete-ERA5.sh
gives a quick example on how to download and convert raw complete ERA5 reanalysis on model levels to higher resolution pressure levels. Alternatively cdo's ml2al command can be used in ml2pl place of can be used to convert to altitude instead of pressure levels.
aggregate-annual-complete-ERA5.sh
runs process-complete-ERA5.sh
for every month in a user defined year to make a final annual Complete ERA5, outputting in both grib and netcdf format
Steps for the bash script (per month due to 10Gb download limit)
- Download data. Set the
config
params first (download_Complete_ERA5.py
) - compute geopotential (python3
compute_geopotential_on_ml_updated.py tq_ml2.grib zlnsp_ml.grib -o z.grib
) - remove z from lnsp before merge, since only at level 1, and not to conflict with new z (
cdo delname,z zlnsp_ml.grib lnsp.grib
) - Merge z, lnsp, and tq_ml2 for monthly forecast with all variables (
cdo merge tq_ml2.grib z.grib lnsp.grib Jan-2022.grib
) - Convert to new user_defined pressure levels (
cdo ml2l...
) - Remove uncessary variables (
cdo delname,q,t,lnsp Jan-2022_pres_temp.grib Jan-2022_pres.grib
) - Remove all intermediate processing files. (
rm Jan-2022_pres_temp.grib,z.grib, lnsp.grib
)
check_corruption.py
Gives an example on how to check if a netcdf4 forecast is corrupted. Often, when a netcdf4 file is corrupted, the file can still be imported properly and not throw any errors, but causes problems later hwne trying to access the data.explore_netcdf.py
examine the output of a processed netcdf using python's xarray librarysplit.py
gives an example on how to subset a larger forecast to a smaller region and/or timescale
- Tristan Schuler - U.S. Naval Research Laboratory
- Grib files are smaller than netcdf file, by almost half.
- The hybrid sigma variables are only present if full model level list is downloaded (potentially will work if surface pressure is included for a subset of levels?)
- You can't combine multi level geopotential data (z) with a single surface level z file, they interfere
- Clean and Fix Forward Forecast processing
- Write a script that autochecks differences between formats
- GFS Support