Why is jax failing to allocate memory despite significant memory being free? #30276

Kepler7894i · 2025-07-17T12:07:54Z

Kepler7894i
Jul 17, 2025

I have been having issues allocating memory on the GPU with jax. I am running on a cluster that gives me access to a RTX6000 (24 GB vram) which jax is attempting to allocate to.

Output of jax.print_environment_info() produces a nvidia-smi table showing that memory is free on the GPU and yet jax fails to allocate it with RESOURCE_EXHAUSTED error, failing to allocate 3.64 GB, despite over 16 GB being free on the card.

I have attempted to fix this issue with jax memory configuration settings:

os.environ["XLA_PYTHON_CLIENT_ALLOCATOR"] = "platform"
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "0.90"
os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"

My initial ideas where either there was an allocation limit at 4 GB due to 32 bit addressing (hence why I have now ensured 64 bit is enabled) or that it was having difficulties allocating above the default "XLA_PYTHON_CLIENT_MEM_FRACTION" = "0.75" memory reservation. Hence my increasing the default memory allocation to 0.9 and trying ensure that it could allocate past that by setting "TF_GPU_ALLOCATOR" = "cuda_malloc_async".

This is failing when trying to allocate self.ne_nc, which is a scalar of a meshgrid of size 992^3, these meshgrids compose the majority of the successfully allocated 7632 MB (2 of them have previously been allocated, both of the same 3.64 GB size as they are 32 bit grids).

This error cropped up as part of a batch script running my program with different parameters to test memory usage at different meshgrid sizes. The 992 was the first failure, and it continued to fail at each test after this as size increased, until eventually the meshgrids themselves stopped allocating as they got larger.

Meshgrid only attempts to allocate 2 out of 3 outputted grids, so in theory there should still be enough memory free to allocate them despite them being larger.

After this it continues to fail but this was expected as it will attempt to allocate more memory than available on the card at this domain size.

Does anyone have any suggestions as to why its failing? There is no reason why it should fail to allocate, both the nvidia-smi reporter and my own estimations of memory allocation based on program input show that there is more than enough memory free on the card for these allocations. I suspect that changing the jax memory configuration flags should fix it but as of yet I have been unable to find a setup that works. Many thanks.

Complete working example as per request by @jakevdp:

import numpy as np

import sys
import os

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-d", "--domain", type = int)
parser.add_argument("-r", "--rays", type = int)
args = parser.parse_args()

n_cells = 128
if args.domain is not None:
    n_cells = args.domain

Np = 10000
if args.rays is not None:
    Np = args.rays

from multiprocessing import cpu_count

assert "jax" not in sys.modules, "jax already imported: you must restart your runtime - DO NOT RUN THIS FUNCTION TWICE"
os.environ['XLA_FLAGS'] = "--xla_force_host_platform_device_count=" + str(cpu_count())
os.environ["XLA_PYTHON_CLIENT_ALLOCATOR"] = "platform"
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "0.9"
os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"

import jax

jax.config.update('jax_enable_x64', True)
jax.config.update('jax_traceback_filtering', 'off')

extent_x = 5e-3
extent_y = 5e-3
extent_z = 10e-3

probing_extent = extent_z

lengths = 2 * np.array([extent_x, extent_y, extent_z])

import jax.numpy as jnp

from scipy.constants import c
from scipy.constants import e

class ScalarDomain():
    def __init__(self, lengths, dim):
        self.x_length, self.y_length, self.z_length = lengths[0], lengths[1], lengths[2]
        self.x_n, self.y_n, self.z_n = dim, dim, dim

        self.x = jnp.float32(jnp.linspace(-self.x_length / 2, self.x_length / 2, self.x_n))
        self.y = jnp.float32(jnp.linspace(-self.y_length / 2, self.y_length / 2, self.y_n))
        self.z = jnp.float32(jnp.linspace(-self.z_length / 2, self.z_length / 2, self.z_n))

        self.XX, self.YY, _ = jnp.meshgrid(self.x, self.y, self.z, indexing = 'ij', copy = True)
        self.ZZ = None

        self.XX = self.XX.at[:, :].set(self.XX / 2e-3)
        self.XX = self.XX.at[:, :].set(10 ** self.XX)

        self.YY = self.YY.at[:, :].set(self.YY / 1e-3)
        self.YY = self.YY.at[:, :].set(jnp.pi * self.YY)
        self.YY = self.YY.at[:, :].set(2 * self.YY)
        self.YY = self.YY.at[:, :].set(jnp.cos(self.YY))
        self.YY = self.YY.at[:, :].set(1 + self.YY)

        self.ne = self.XX * self.YY

        self.ne = self.ne.at[:, :].set(1e24 * self.ne)

domain = ScalarDomain(lengths, n_cells)

lwl = 1064e-9

Np = 10000
divergence = 5e-5
beam_size = extent_x
ne_extent = probing_extent
beam_type = 'circular'

def init_beam(Np, beam_size, divergence, ne_extent):
    s0 = jnp.zeros((9, Np))

    t  = 2 * jnp.pi * np.random.randn(Np)

    u  = np.random.randn(Np)

    ϕ = jnp.pi * np.random.randn(Np)
    χ = divergence * np.random.randn(Np)

    s0 = s0.at[0, :].set(beam_size * u * jnp.cos(t))
    s0 = s0.at[1, :].set(beam_size * u * jnp.sin(t))
    s0 = s0.at[2, :].set(-ne_extent)

    s0 = s0.at[3, :].set(c * jnp.sin(χ) * jnp.cos(ϕ))
    s0 = s0.at[4, :].set(c * jnp.sin(χ) * jnp.sin(ϕ))
    s0 = s0.at[5, :].set(c * jnp.cos(χ))

    s0 = s0.at[6, :].set(1.0)
    s0 = s0.at[8, :].set(0.0)
    s0 = s0.at[7, :].set(0.0)

    return s0

beam_definition = init_beam(Np, beam_size, divergence, ne_extent)

from jax.scipy.interpolate import RegularGridInterpolator

def calc_dndr(ne, lwl = 1064e-9):
    omega = 2 * jnp.pi * c / lwl
    nc = 3.14207787e-4 * omega ** 2

    return jnp.array(ne / nc, dtype = jnp.float32)

def dsdt(t, s, ne_nc, x, y, z):
    s = jnp.reshape(s, (9, 1))
    sprime = jnp.zeros_like(s)

    # ... algorithm to propagate rays ...
    # irrelevant to issue and lots of code so has been removed

    return sprime.flatten()

def solve(s0_import, ne_nc, x, y, z, x_n, y_n, z_n, extent):
    Np = s0_import.shape[1]
    s0 = s0_import.T
    del s0_import

    t = jnp.linspace(0.0, jnp.sqrt(8.0) * extent / c, 2)
    norm_factor = jnp.max(t)

    def dsdt_ODE(t, y, args):
        return dsdt(t, y, *args) * norm_factor

    import diffrax

    def diffrax_solve(dydt, t0, t1, Nt, rtol = 1e-7, atol = 1e-9):
        term = diffrax.ODETerm(dydt)
        solver = diffrax.Tsit5()
        saveat = diffrax.SaveAt(ts = jnp.linspace(t0, t1, Nt))
        stepsize_controller = diffrax.PIDController(rtol = rtol, atol = atol)

        return lambda s0, args : diffrax.diffeqsolve(
            term,
            solver,
            y0 = jnp.array(s0),
            args = args,
            t0 = t0,
            t1 = t1,
            dt0 = (t1 - t0) * norm_factor / Nt,
            saveat = saveat,
            stepsize_controller = stepsize_controller,
            # set max steps to no. of cells x100
            max_steps = x_n * y_n * z_n * 100 #10000 - default for solve_ivp?????
        )

    from equinox import filter_jit
    ODE_solve = filter_jit(diffrax_solve(dsdt_ODE, t[0], t[-1] / norm_factor, 2))

    args = (ne_nc, x, y, z)
    sol = jax.block_until_ready(jax.vmap(lambda s: ODE_solve(s, args))(s0))

    del ne_nc

    return sol.ys[:, -1, :].T

rf = solve(beam_definition, calc_dndr(domain.ne, lwl), domain.x, domain.y, domain.z, domain.x_n, domain.y_n, domain.z_n, ne_extent)

Link to a stack overflow post about this issue: https://stackoverflow.com/questions/79701945/jax-unable-to-allocate-memory-despite-memory-being-free

jakevdp · 2025-07-17T12:54:33Z

jakevdp
Jul 17, 2025
Maintainer

It would be useful if you could provide a complete minimal example (self-contained script that can be executed by another user to see the behavior you're seeing). I gave up trying to answer on StackOverflow because there was nothing more I could do with the information you provided.

I still suspect that intermediate allocations are to blame. In this line:

self.ne = n_e0 * (1.0 + s1 * self.XX / self.x_length) * (1 + s2 * jnp.cos(2 * jnp.pi * self.YY / Ly))

I count 11 intermediate arrays of the same size as self.XX and self.YY that will be allocated if this is executed eagerly. These will be more-or-less immediately dereferenced, but there is some delay between Python dereferencing and cleanup of the buffer from the device.

4 replies

Kepler7894i Jul 17, 2025
Author

I shall get a complete working example, there is a lot of code to work through hence my attempt from before, but I will get one as soon as.

I did see your point about intermediate allocations, unfortunately, wrapping in jax.jit did not resolve that issue though. Any other alternatives you could suggest? Or perhaps a different scope for the jit compilation? (Although I appreciate that would require a full working example to provide a proper answer for)

jakevdp Jul 17, 2025
Maintainer

you say "wrapping in JIT did not resolve the issue" – what exactly did you try?

Kepler7894i Jul 21, 2025
Author

This allocation self.ne = n_e0 * (1.0 + s1 * self.XX / self.x_length) * (1 + s2 * jnp.cos(2 * jnp.pi * self.YY / Ly)) is defined in a callable function in the code. I wrapped the function in jax.jit and returned ne, assigning to self.ne outside of it.

Something like:

@jax.jit
def allocate_ne(self, n_e0, s, Ly):
    return n_e0 * 10 ** (self.XX / s) * (1 + jnp.cos(2 * jnp.pi * self.YY / Ly))

self.ne = allocate_ne(...some values...)

Kepler7894i Jul 22, 2025
Author

I've uploaded the CWE, sorry it took me so long to get round to it.

Running it with n_cells = 1024, still leads to an allocation error. As you can see, I've attempted to allocate self.ne in overwriting steps to avoid the issue you mentioned on stack overflow, but it still fails.

In fact on the first line of this alternative attempt:

E0722 11:15:05.856044 1630324 pjrt_stream_executor_client.cc:2839] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Failed to allocate request for 4.00GiB (4294967296B) on device ordinal 0
Traceback (most recent call last):
  File "/rds/general/user/sm5625/home/synthPy/evaluation/MWE_debugging/complete_MWE_21_7_2025.py", line 73, in <module>
    domain = ScalarDomain(lengths, n_cells)
  File "/rds/general/user/sm5625/home/synthPy/evaluation/MWE_debugging/complete_MWE_21_7_2025.py", line 60, in __init__
    self.XX = self.XX.at[:, :].set(self.XX / 2e-3)

ezhulenev · 2025-07-21T15:38:16Z

ezhulenev
Jul 21, 2025

@nouiz maybe? cuda_malloc_async question

0 replies

Why is jax failing to allocate memory despite significant memory being free? #30276

Uh oh!

Uh oh!

Kepler7894i Jul 17, 2025

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

jakevdp Jul 17, 2025 Maintainer

Uh oh!

Kepler7894i Jul 17, 2025 Author

Uh oh!

Uh oh!

jakevdp Jul 17, 2025 Maintainer

Uh oh!

Uh oh!

Kepler7894i Jul 21, 2025 Author

Uh oh!

Uh oh!

Kepler7894i Jul 22, 2025 Author

Uh oh!

ezhulenev Jul 21, 2025

Kepler7894i
Jul 17, 2025

Replies: 2 comments 4 replies

jakevdp
Jul 17, 2025
Maintainer

Kepler7894i Jul 17, 2025
Author

jakevdp Jul 17, 2025
Maintainer

Kepler7894i Jul 21, 2025
Author

Kepler7894i Jul 22, 2025
Author

ezhulenev
Jul 21, 2025