Chunking input files #1951

JamiePringle · 2025-04-01T14:47:52Z

JamiePringle
Apr 1, 2025
Collaborator

I was asked if chunking the input circulation fields could make a difference, and if that entailed converting netCDF data into a zarr data store. In short, chunking and compressing the circulation can result in a very substantial speedup, especially if the model is being run in parallel with MPI, but even otherwise. There is no need to write the data as zarr, and I usually find it better to leave it as netCDF. This also allows the input circulation data to be used by any other analysis tools you might be using without altering their code. I have found that appropriate chunking and compressing of data decreases run times by a factor of 1.4 to 4, depending on circumstances.

In this work, I will assume that your system is typical, and that IO speeds are a more important bottleneck than the CPU overhead of uncompromising data. This has almost always been the case for me, but in all of this, a certain amount of benchmarking can be very, very useful.

First, you want to see if your circulation data is already chunked and compressed. For a netCDF file, run ncdump -h -s YourFile.nc and look at the output for each variable. You should see something like

	float vozocrtx(time_counter, deptht, y, x) ;
		vozocrtx:_FillValue = NaNf ;
		vozocrtx:units = "m s-1" ;
		vozocrtx:valid_min = -10. ;
		vozocrtx:valid_max = 10. ;
		vozocrtx:long_name = "Zonal velocity" ;
		vozocrtx:standard_name = "sea_water_x_velocity" ;
		vozocrtx:short_name = "vozocrtx" ;
		vozocrtx:online_operation = "N/A" ;
		vozocrtx:interval_operation = 86400LL ;
		vozocrtx:interval_write = 86400LL ;
		vozocrtx:associate = "time_counter deptht nav_lat nav_lon" ;
		vozocrtx:coordinates = "nav_lon nav_lat" ;
		vozocrtx:_Storage = "chunked" ;
		vozocrtx:_ChunkSizes = 1, 1, 256, 256 ;
		vozocrtx:_Shuffle = "true" ;
		vozocrtx:_DeflateLevel = 1 ;
		vozocrtx:_Endianness = "little" ;

The important data are in the last five lines. You can see that the "_Storage" is "chunked" and you can see the chunksizes in the following line. This shows that a chunk of data is 1 time, 1 depth, 256 in y, 256 in x. In general, you can use smaller chunk sizes in netCDF than zarr, because all data is stored in a single file. You can also see that the "_DeflateLevel", or compression, is 1. This indicates that it is compressed. It is very important that if you compress data you also chunk it, since it is harder for the netCDF to seek arbitrary locations in compressed data.

If your data is not compressed and chunked, you should play with doing so. Start with a scheme something like that above, but I would experiment on your case. In the run with this data, chunking and compressing the input netCDF speed up the particle tracking by a factor of two and left the netCDF files usable by all other programs. I would set the chunking in parcels to match the chunking of the netCDF file, or some multiple of it. For some mysterious reason, I have found that my code works fastest if the chunksize in parcels is 1,1,512,512. Neither the netCDF nor parcels chunk sizes need to divide evenly into the array dimension.

So how do you chunk and compress the data? If you are downloading it with xarray and than saving it to disk with .to_netcdf(), or if you want to process existing files in python, you can do something like the following case. In this case varName is the name of the 4D variable, and nav_lon and nav_lat are the names of 2D variables. All variables are in the xarray dataset dataIn

        #now write data
        encoding={}
        for var in [varName,'nav_lon','nav_lat']:
            compression = dict(zlib=True,complevel=1)
            encoding[var] = compression

        #now add chunk sizes to what we are writing out
        encoding[varName]['chunksizes']=[1,1,256,256]
        encoding['nav_lon']['chunksizes']=[256,256]
        encoding['nav_lat']['chunksizes']=[256,256]
        thisData.to_netcdf(dataOutDir+filename,unlimited_dims='time_counter',encoding=encoding)

However, it is often easier to use a command line function to rechunk and compress the data. In this case the netCDF kitchen sink command, ncks is very useful. You can run a command like

ncks -O -4 -L 4 --cnk_plc=g2d  --cnk_dmn deptht,1 --cnk_dmn time_counter,1 --cnk_dmn y,256 --cnk_dmn x,256 input_file.nc output_file.nc

where deptht, time_counter, x and y are the dimensions you want to chunk, and the number after the comma is the chunk size. The -4 makes sure the output is in netCDF4, and so it can be compressed. -O overwrites existing files, so use with care. '-L 4' sets the compression level.

I hope this is helpful, @iuryt and @erikvansebille

iuryt · 2025-04-01T20:39:01Z

iuryt
Apr 1, 2025

Hi @JamiePringle

I have a netCDF file that I downloaded from CMEMS using copernicusmarine package.
I used what you suggested and the ncdump shows that the file is (depth, latitude, longitude,time) = (1,383,473,9497) and not chunked.
Thus, I followed what you suggested and run ncks -O -4 -L 4 --cnk_plc=g2d --cnk_dmn depth,1 --cnk_dmn time,1 --cnk_dmn latitude,128 --cnk_dmn longitude,128 cmems_reanalysis.nc cmems_reanalysis_chunked.nc

With the unchunked file

-np 1  =   2m35.903s
-np 8  =   3m1.749s

With the chunked file

-np 1  =   2m42.005s
-np 8  =   3m6.058s

I am really confused what is happening here, because the chunked file and more cores is making the run slower. Of course, I understand that this could occur depending on how you chunk and parallel the process. For example, if you chunk too much it might get slower to communicate between workers than to run the task itself, or when there is not much task to perform, but it doesn't seem very clear to me for this case.

I could work on a better MWE if you think it is useful, but this is the one I've been running for those benchmarks.

import os
import sys
from datetime import datetime, timedelta
from glob import glob

import numpy as np
import pandas as pd
import xarray as xr
from tqdm import tqdm

from parcels import logger
import parcels
from tools import SmagorinskyDiff

# Set logger level to WARNING
logger.setLevel("WARNING")

### DEFINE THE LOCATION OF THE PARTICLES ###
# This is basically just to define all possible options for particle release

# Define the central coordinates (Icapuí, CE - Brasil)
clat, clon = -4.645110, -37.316935

# Define the bounding box dimensions (degrees)
dx, dy = 8, 6
box = [
    clon - dx * 2 / 3, clon + dx / 3,
    clat - dy / 3, clat + dy * 2 / 3
]

# Load and preprocess the bathymetry dataset
bat = xr.open_dataset("../../data/external/gebco/gebco_subset.nc").elevation
bat = bat.sel(lon=slice(*box[:2]), lat=slice(*box[2:]))
window = 30  # Smoothing window size
bat = bat.rolling(lon=window, lat=window, center=True, min_periods=1).mean()

# Calculate distances from the central point
dist = np.sqrt((bat.lat - clat)**2 + (bat.lon - clon)**2) * 112

# Define conditions for particle placement
where = (bat > -50) & (bat < 0) & (dist < 50)
plon, plat = xr.broadcast(bat.lon, bat.lat)
plon = plon.where(where).stack(p=["lon", "lat"]).dropna("p").values
plat = plat.where(where).stack(p=["lon", "lat"]).dropna("p").values

### THE LAGRANGIAN SIMULATION ###

# Simulation parameters
nparticles = 50 # number of particles to be release for each 5 days
runtime = 30  # days
total_time = 100  # days

# Load the ocean current dataset
ds = xr.open_dataset("../../data/external/nemo/cmems_reanalysis.nc")
ds = ds.sel(time=slice(None, ds.time.min() + np.timedelta64(total_time, 'D')))

# Create output directory
path = "../../data/processed/parcels"
os.makedirs(path, exist_ok=True)

# Define particle behavior functions
# We are excluding particles after the predefined maximum number of days
def Age(particle, fieldset, time):
    """Update particle age."""
    particle.age += particle.dt / 86400

def DeleteParticle(particle, fieldset, time):
    """Delete particle if its age exceeds the runtime."""
    if particle.age > fieldset.runtime:
        particle.delete()

# Add an 'age' variable to the particle class
AgeParticle = parcels.JITParticle.add_variable(parcels.Variable("age", initial=0))

# Define fieldset variables and dimensions
variables = {"U": "uo", "V": "vo"}
dimensions = {"lat": "latitude", "lon": "longitude", "time": "time"}

# Define chunk size for fieldset
cs = 128
if cs not in ["auto", False]:
    cs = {"time": ("time", 1), "lat": ("latitude", cs), "lon": ("longitude", cs)}

# Create the fieldset
fieldset = parcels.FieldSet.from_xarray_dataset(
    ds, variables, dimensions, allow_time_extrapolation=True, chunksize=cs
)

# Add constants and fields to the fieldset
fieldset.add_constant("runtime", runtime)
x = fieldset.U.grid.lon
y = fieldset.U.grid.lat
cell_areas = parcels.Field(name="cell_areas", data=fieldset.U.cell_areas(), lon=x, lat=y)
fieldset.add_field(cell_areas)
fieldset.add_constant("Cs", 0.1)

# Generate initial particle positions and times
ploni, plati, timei = [], [], []
for di in np.arange(0, total_time, 5):
    ind = np.random.randint(0, plon.size - 1, nparticles)
    ploni.append(plon[ind])
    plati.append(plat[ind])
    timei.append([timedelta(days=float(di)).total_seconds()] * nparticles)
ploni, plati, timei = np.concatenate(ploni), np.concatenate(plati), np.concatenate(timei)

# Create the particle set
pset = parcels.ParticleSet(
    fieldset=fieldset, pclass=AgeParticle, lon=ploni, lat=plati, time=timei
)

# Define output file for particle trajectories
output_fname = "parcels.zarr"
output_file = pset.ParticleFile(
    name=os.path.join(path, output_fname), outputdt=timedelta(hours=3)
)

# Execute the particle simulation
pset.execute(
    [Age, parcels.AdvectionRK4, SmagorinskyDiff, DeleteParticle],
    endtime=timei.max() + runtime * 86400,
    dt=timedelta(hours=0.5),
    output_file=output_file,
    verbose_progress=True
)

The SmagorinskyDiff kernel is based on the documentation for that.

0 replies

JamiePringle · 2025-04-01T22:57:02Z

JamiePringle
Apr 1, 2025
Collaborator Author

@iuryt Interesting. I presume you double checked with ncdump -h -s that the chunking and compression worked as expected after you ran ncks? One minor thing is that the ncks command in your message above appears incorrect -- you compress y and longitude but for most Copernicus products, the dimensions are y and x.

My guess is that your problem is much, much smaller than mine. My grid of (time, depth, y,x) is (365,36,3059,4322), so that even one time point is 2627 times larger than yours ((36 x 3059 x 4322)/(1 x 383 x 473)).

Another sign of the difference in scale is that I gain considerably from running in parallel, while you lose. I think our problems lie on different ends of the spectrum when it comes to trading off the complexity of parallelization and chunking compared to the potential gain.

I guess this underscores the importance of benchmarking of problems at the scale relevant to our work.

Jamie

1 reply

iuryt Apr 3, 2025

Sorry, it was supposed to be ncks -O -4 -L 4 --cnk_plc=g2d --cnk_dmn depth,1 --cnk_dmn time,1 --cnk_dmn latitude,128 --cnk_dmn longitude,128 cmems_reanalysis.nc cmems_reanalysis_chunked.nc . My file has longitude and latitude as dimensions rather than x and y. I checked and the file is indeed chunked after performing this command.

JamiePringle · 2025-04-02T14:07:10Z

JamiePringle
Apr 2, 2025
Collaborator Author

To confirm with some benchmarks what I had said above, running with a NEMO grid from the Mercator GLORYS V12R3 model, global domain, on a (time, depth, y,x) dataset of size (365,36,3059,4322), with a total of about 1,549,325 start locations, each released 30 times for a total of 46,479,750 particles, chunking both the input data files and explicitly providing a chunksize to the fieldset makes a big difference.

In this case, a 30 day run made on 8 cores is 204 minutes with chunking, and 280 minutes without it. So the chunking reduces the run time by 28%. In runs which often take a week to do, this can bring the run time down from 7 days to 5 days.

In general, the more cores I use, the bigger difference the chunking makes. But again, this depends on the details of your system and the parcels run. Also, if you are doing runs in which the particles are released in a much smaller domain than the model domain, chunking can have a very big impact.

Jamie

0 replies

iuryt · 2025-04-03T19:00:15Z

iuryt
Apr 3, 2025

I really don't understand what is going on, and I am not sure how to proceed with investigating this. My simulation is small because I am only running it for 100 days; however, if I change that to 20 years of reanalysis, it does not scale linearly. For example, if 100 days takes about 2.5 minutes to run, I would expect 20 years to take about 3 hours, but the verbose_progress anticipates about 168 hours to complete the code. If I release 50 particles every 5 days, that would result in 73,000 particles, which is not excessively large, but it seems like a parallelizable problem. If I try to run it with mpirun -np 8, the expected time jumps to over 200 hours. Perhaps the issue is that I don't have 73,000 particles running at the same time, and I should expect something closer to 300 (month-long particles) running concurrently, since I delete particles after a month of tracking. I thought about using dask to paralelize the process for each month instead, but parcels don't work with dask this way, so I am not sure how to make this work. I might try to use multiprocessing to see if I can make it run in parallel for months.

5 replies

JamiePringle Apr 3, 2025
Collaborator Author

There are several things I would look at. The first is to realize that if your particles live forever, than the time to run should grow as the SQUARE of the number of years you are simulating, if it is linear in the number of particles. Think about it this way: if the number of particles grows linearly with model time, and the run time is the integral in model time of the number of particles running at any given model time, the run-time will grow as the square of model time.

However, your results do seem slow. I wonder if the chunking of your output is too small! I suspect it is chunked by 50 in the trajectory dimension of the output. And even worse if you run in parallel -- this would explain poor performance of parallelization.

to check this, load the drifter output as a zarr file (or one of the proc*.zarr files if parallelizing), and assuming the variable is dataIn, look at dataIn.lon.info. I bet you the chunk size of the variable is very small, and so it takes a huge number of files in the .zarr directories to store the data. This is very inefficient.

See the discussions linked to in #1473 -- most of them are helpful -- but I would start by setting your output chunksize to something like trajectory=10000,obs=2 and see if that helps.

iuryt Apr 3, 2025

@JamiePringle

I managed to speed up the process significantly by parallelizing with multiprocessing. Using 40 cores, I can now run the full 25-year simulation in about 45 minutes. That said, I’ve realized that in my case, the most efficient approach is to parallelize over time rather than over particles. In some way, I end up also parallelizing by particles, but in a more logical way, given the nature of my project.

I’ve included a draft of the code below for reference. It’s a good reminder of how the strategy for parallel computing really depends on the structure of your data and the nature of the task. I was wondering if there’s any way to support this kind of time-based parallelization natively within parcels i.e. if you need to run multiple short experiments within a long time series.

Code

# %%
import os
import sys
from datetime import datetime, timedelta
from glob import glob
from multiprocessing import Pool

import numpy as np
import pandas as pd
import xarray as xr
from tqdm import tqdm

from parcels import logger
import parcels
from tools import SmagorinskyDiff


# Set logger level to WARNING
logger.setLevel("WARNING")

#%%
### DEFINE THE LOCATION OF THE PARTICLES ###

# Define the central coordinates (Icapuí, CE - Brasil)
clat, clon = -4.645110, -37.316935

# Define the bounding box dimensions (degrees)
dx, dy = 8, 6
box = [
    clon - dx * 2 / 3, clon + dx / 3,
    clat - dy / 3, clat + dy * 2 / 3
]

# Load and preprocess the bathymetry dataset
bat = xr.open_dataset("../../data/external/gebco/gebco_subset.nc").elevation
bat = bat.sel(lon=slice(*box[:2]), lat=slice(*box[2:]))
window = 30  # Smoothing window size
bat = bat.rolling(lon=window, lat=window, center=True, min_periods=1).mean()

# Calculate distances from the central point
dist = np.sqrt((bat.lat - clat)**2 + (bat.lon - clon)**2) * 112

# Define conditions for particle placement
where = (bat > -50) & (bat < 0) & (dist < 50)
plon, plat = xr.broadcast(bat.lon, bat.lat)
plon = plon.where(where).stack(p=["lon", "lat"]).dropna("p").values
plat = plat.where(where).stack(p=["lon", "lat"]).dropna("p").values

#%%
### THE LAGRANGIAN SIMULATION ###

# Simulation parameters
nparticles = 10
runtime = 30  # days
total_time = 20*365  # days
ncores = 40


# Load the ocean current dataset
ds = xr.open_dataset("../../data/external/nemo/cmems_reanalysis.nc")


# Create output directory
path = "../../data/processed/parcels"
os.makedirs(path, exist_ok=True)

# Define particle behavior functions
def Age(particle, fieldset, time):
    """Update particle age."""
    particle.age += particle.dt / 86400

def DeleteParticle(particle, fieldset, time):
    """Delete particle if its age exceeds the runtime."""
    if particle.age > fieldset.runtime:
        particle.delete()

# Add an 'age' variable to the particle class
AgeParticle = parcels.JITParticle.add_variable(parcels.Variable("age", initial=0))

# Define fieldset variables and dimensions
variables = {"U": "uo", "V": "vo"}
dimensions = {"lat": "latitude", "lon": "longitude", "time": "time"}


def particle_simulation(args):
    """Process a single time range for particle simulation."""
    time_initial, time_final = args
    dsi = ds.sel(time=slice(time_initial, time_final + np.timedelta64(runtime+1, 'D')))

    landmask = (dsi.isel(time=0).uo - dsi.isel(time=0).uo).fillna(1).squeeze()

    u = -(1e3 * landmask.differentiate("longitude"))
    v = -(1e3 * landmask.differentiate("latitude"))

    norm = np.sqrt(u**2 + v**2)
    norm = norm.where(norm > 2000)
    u = (u / norm).fillna(0)
    v = (v / norm).fillna(0)

    boundary = (
        xr.Dataset(dict(uo=u, vo=v))
        .rolling(longitude=3, latitude=3, center=True).mean()
    )

    dsi = dsi + boundary

    # Create the fieldset
    fieldset = parcels.FieldSet.from_xarray_dataset(
        dsi, variables, dimensions, allow_time_extrapolation=True, chunksize=False
    )

    # Add constants and fields to the fieldset
    fieldset.add_constant("runtime", runtime)
    x = fieldset.U.grid.lon
    y = fieldset.U.grid.lat
    cell_areas = parcels.Field(name="cell_areas", data=fieldset.U.cell_areas(), lon=x, lat=y)
    fieldset.add_field(cell_areas)
    fieldset.add_constant("Cs", 0.1)

    total_time = (time_final - time_initial) / np.timedelta64(1, "D")

    # Generate initial particle positions and times
    ploni, plati, timei = [], [], []
    for di in np.arange(0, total_time, 1):
        ind = np.random.randint(0, plon.size - 1, nparticles)
        ploni.append(plon[ind])
        plati.append(plat[ind])
        timei.append([timedelta(days=float(di)).total_seconds()] * nparticles)
    ploni, plati, timei = np.concatenate(ploni), np.concatenate(plati), np.concatenate(timei)

    # Create the particle set
    pset = parcels.ParticleSet(
        fieldset=fieldset, pclass=AgeParticle, lon=ploni, lat=plati, time=timei
    )

    # Define output file for particle trajectories
    output_fname = f"parcels_{pd.Timestamp(time_initial).strftime('%Y_%m_%d')}.zarr"
    output_file = pset.ParticleFile(
        name=os.path.join(path, output_fname), outputdt=timedelta(hours=3)
    )

    # Execute the particle simulation
    pset.execute(
        [Age, parcels.AdvectionRK4, SmagorinskyDiff, DeleteParticle],
        endtime=timei.max() + runtime * 86400,
        dt=timedelta(hours=0.5),
        output_file=output_file,
        verbose_progress=True
    )

    
times = ds.time.values[::ds.time.size//ncores]

# Prepare time ranges for multiprocessing
time_ranges = list(zip(times[:-1], times[1:]))

# Use multiprocessing to parallelize the loop
with Pool(processes=ncores) as pool:
    pool.map(particle_simulation, time_ranges)

erikvansebille Apr 4, 2025
Maintainer

Thanks for this investigation and for sharing your code, @iuryt. It's always a good idea to explore the bottleneck in terms of performance in a Parcels simulation (input reading, kernel-execution or output writing). Doing a simulation without output is also a good way to test performance. Or, as discussed above, without deleting particles.

JamiePringle Apr 4, 2025
Collaborator Author

@iuryt That is a clever solution. I would still look at the chunksize of your output file, and perhaps experiment with setting it explicitly with something like

    output_file = pset.ParticleFile(
        name=os.path.join(path, output_fname), outputdt=timedelta(hours=3),chunks=(2000,2)
    )

I am assuming each multiprocessing job releases roughly 365x10x20/40=2000ish particles per core. What is the current chunksize in your output? If it is much less than 2000, I think you might be leaving significant performance on the table.

JamiePringle Apr 4, 2025
Collaborator Author

p.s. if you are looking for pre-computed Lagrangian pathways in the coastal ocean, have you checked out https://github.com/JamiePringle/EZfate ?

Chunking input files #1951

Uh oh!

Uh oh!

JamiePringle Apr 1, 2025 Collaborator

Replies: 4 comments · 6 replies

Uh oh!

Uh oh!

iuryt Apr 1, 2025

Uh oh!

Uh oh!

JamiePringle Apr 1, 2025 Collaborator Author

Uh oh!

iuryt Apr 3, 2025

Uh oh!

Uh oh!

JamiePringle Apr 2, 2025 Collaborator Author

Uh oh!

Uh oh!

iuryt Apr 3, 2025

Uh oh!

JamiePringle Apr 3, 2025 Collaborator Author

Uh oh!

Uh oh!

iuryt Apr 3, 2025

Uh oh!

erikvansebille Apr 4, 2025 Maintainer

Uh oh!

Uh oh!

JamiePringle Apr 4, 2025 Collaborator Author

Uh oh!

JamiePringle Apr 4, 2025 Collaborator Author

JamiePringle
Apr 1, 2025
Collaborator

Replies: 4 comments 6 replies

iuryt
Apr 1, 2025

JamiePringle
Apr 1, 2025
Collaborator Author

JamiePringle
Apr 2, 2025
Collaborator Author

iuryt
Apr 3, 2025

JamiePringle Apr 3, 2025
Collaborator Author

erikvansebille Apr 4, 2025
Maintainer

JamiePringle Apr 4, 2025
Collaborator Author

JamiePringle Apr 4, 2025
Collaborator Author