Possible regression in jaxlib 0.4.25+ causing training deadlocks on GPU #25453

connorsteph · 2024-12-12T23:11:14Z

connorsteph
Dec 12, 2024

We have a model training script that began to experience deadlocks during GPU computation since upgrading from jax 0.4.13 --> 0.4.25+. In particular, these issues emerge with jaxlib 0.4.25, disappear with 0.4.26, and are then present from jaxlib 0.4.27 onwards. We'd appreciate any insights into how we can further understand what's going on. We're attempting to create an MRE in the meantime, but our training code is quite complicated and we're working on bisecting the issue. These issues are present with both single and multi-GPU training runs, and during testing the single-GPU case, removing all sharding-related code does not resolve the issue.

Regression description:

At some point during training we call into a jitted single_step function (computing loss and gradients) and this function never exits (nor does it crash), as evidenced by a py-spy trace. This happens non-deterministically minutes to hours into training runs. We're using weight and biases for logging, and from system resource logs we can see that at the time of the deadlock our GPU power usage decreases to a nontrivial amount and stays at that level with extremely low variation (image below), and looking at the python process we can see that it's waiting for control to return. To reiterate, this appears to be a regression. Our training seems to run just fine on jaxlib 0.4.24 and below.

Here's what the GPU power usage looks like, with the hang occuring at around ~26k on the x-axis:

For reference, our H100s idle at ~100W, so something is happening.

Attempt at diagnostics:

When I exec into the training pod after the hang has occurred, I see that the training python process is alive (PID 1)

> root@cray-zwhpd:/home/app# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  392  2.0 358606700 22123180 ?  Ssl  Nov21 4528:54 /tmp/tmprxjrqp4z/app.runfiles/python3_10_x86_64-unknown-linux-gnu/bin/python
root         504  0.5  0.0 1247260 30324 ?       Ssl  Nov21   5:58 /tmp/tmprxjrqp4z/app.runfiles/atomic_python_deps_wandb/site-packages/wandb/bi
root         515  132  0.0 7263072 138608 ?      Sl   Nov21 1530:43 /app.runfiles/atomic_python_deps_wandb/site-packages/wandb/bin/nvidia_gpu_st
root       19092  0.0  0.0   4324  3896 pts/0    Ss   10:27   0:00 bash
root       19290  0.0  0.0   6352  3744 pts/0    R+   10:30   0:00 ps -aux

but it's waiting on a FUTEX

> root@cray-zwhpd:/home/app# strace -p 1
strace: Process 1 attached
futex(0x7f6b4e85d140, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 1 detached
 <detached ...>

I obtained a backtrace from gdb

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f6b49c19b88 in absl::lts_20230802::synchronization_internal::FutexWaiter::WaitUntil(std::atomic<int>*, int, absl::lts_20230802::synchronization_internal::KernelTimeout) ()
   from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#2  0x00007f6b49c19c27 in absl::lts_20230802::synchronization_internal::FutexWaiter::Wait(absl::lts_20230802::synchronization_internal::KernelTimeout) () from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#3  0x00007f6b49c19adb in AbslInternalPerThreadSemWait_lts_20230802 ()
   from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#4  0x00007f6b49c178fd in absl::lts_20230802::Mutex::Block(absl::lts_20230802::base_internal::PerThreadSynch*) ()
   from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#5  0x00007f6b49c17d82 in absl::lts_20230802::Mutex::AwaitCommon(absl::lts_20230802::Condition const&, absl::lts_20230802::synchronization_internal::KernelTimeout) () from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#6  0x00007f6b49c17c81 in absl::lts_20230802::Mutex::Await(absl::lts_20230802::Condition const&) ()
   from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#7  0x00007f6b42060a98 in xla::PjRtStreamExecutorLoadedExecutable::Execute(absl::lts_20230802::Span<std::vector<xla::PjRtBuffer*, std::allocator<xla::PjRtBuffer*> > const>, xla::ExecuteOptions const&, std::optional<std::vector<xla::PjRtFuture<void>, std::allocator<xla::PjRtFuture<void> > > >&) () from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#8  0x00007f6b41fb301f in pjrt::PJRT_LoadedExecutable_Execute(PJRT_LoadedExecutable_Execute_Args*) ()
   from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#9  0x00007f6b56a7bd2c in xla::PjRtCApiLoadedExecutable::Execute(absl::lts_20230802::Span<std::vector<xla::PjRtBuffer*, std::allocator<xla::PjRtBuffer*> > const>, xla::ExecuteOptions const&, std::optional<std::vector<xla::PjRtFuture<void>, std::allocator<xla::PjRtFuture<void> > > >&) ()
   from /tmp/tmprxjrqp4z/app.runfiles/atomic_python_deps_jaxlib/site-packages/jaxlib/xla_extension.so
#10 0x00007f6b5c74a0b3 in xla::ifrt::PjRtLoadedExecutable::Execute(absl::lts_20230802::Span<tsl::RCReference<xla::ifrt::Array> >, xla::ifrt::ExecuteOptions const&, std::optional<tsl::RCReference<xla::ifrt::DeviceList> >) ()
   from /tmp/tmprxjrqp4z/app.runfiles/atomic_python_deps_jaxlib/site-packages/jaxlib/xla_extension.so
#11 0x00007f6b5699dc30 in jax::(anonymous namespace)::PjitFunction::Call(nanobind::handle, _object* const*, unsigned long, _object*) ()
   from /tmp/tmprxjrqp4z/app.runfiles/atomic_python_deps_jaxlib/site-packages/jaxlib/xla_extension.so
#12 0x00007f6b5699aa30 in PjitFunction_tp_vectorcall ()
   from /tmp/tmprxjrqp4z/app.runfiles/atomic_python_deps_jaxlib/site-packages/jaxlib/xla_extension.so
#13 0x00007f6c01be0a10 in _PyObject_VectorcallTstate (tstate=0x5625aa78a5d0, callable=0x7f6b4e85d140, args=0x5625dd917950,
    nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114
#14 PyObject_Vectorcall (callable=0x7f6b4e85d140, args=0x5625dd917950, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:123
#15 call_function (tstate=0x5625aa78a5d0, trace_info=0x7ffe051b1280, oparg=<optimized out>, kwnames=0x0, pp_stack=<optimized out>)
    at Python/ceval.c:5891
#16 _PyEval_EvalFrameDefault (tstate=0x5625aa78a5d0, f=0x5625dd917750, throwflag=<optimized out>) at Python/ceval.c:4181
#17 0x00007f6c01b8296f in _PyEval_EvalFrame (tstate=0x5625aa78a5d0, f=0x5625dd917750, throwflag=0) at ./Include/internal/pycore_ceval.h:46
...

I can provide the rest of the bt if anyone thinks it would be helpful.

Obligatory environment dump:

$ jax.print_environment_info()
jax:    0.4.35
jaxlib: 0.4.34
numpy:  1.26.4
python: 3.10.9 (main, Jan 16 2023, 22:37:31) [Clang 15.0.7 ]
device info: NVIDIA H100 80GB HBM3-1, 1 local devices"
process_count: 1
platform: uname_result(system='Linux', node='cray-fjhl4', release='5.15.0-126-generic', version='#136-Ubuntu SMP Wed Nov 6 10:38:22 UTC 2024', machine='x86_64')
$ nvidia-smi
Fri Nov 22 13:38:57 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:9B:00.0 Off |                    0 |
| N/A   43C    P0             85W /  700W |     536MiB /  81559MiB |      2%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
None

Any insights into what's happening or suggestions for debugging this issue would be massively appreciated!

hawkinsp · 2024-12-20T16:01:36Z

hawkinsp
Dec 20, 2024
Maintainer

Unfortunately it's impossible to say what's going wrong with only this information. It looks like a deadlock, but we don't know how or why without knowing more.

I can think of two things that might help:
a) collect all of the thread backtraces, and do so on all of the processes if this is a multiprocess job. bt only obtains a backtrace from one thread, you want thread apply all bt to collect all thread stacks. The output will be long, and you'll want to save the output to a file and attach it here. (I forget exactly how to do that: you turn off pagination in gdb and tell gdb to log to a file, if I remember correctly.)

b) a reproducer that we could run would also help.

3 replies

mattjj Dec 20, 2024
Maintainer

@connorsteph is there any sort of host callback going on?

connorsteph Dec 20, 2024
Author

Thank you both for your replies! I know that we're not giving you much to go off of here; we're working hard to hone in on a shareable example but as you can imagine the feedback-loop here is very slow, and we haven't been able to get the problem to emerge yet with 'dummy' inputs that are trivial arrays of the correct shape.

@mattjj no, there are no host callbacks anywhere in the training code.

@hawkinsp thank you for the suggestion -- this is a single-process job. I'll prepare another deadlock and collect a full backtrace on all threads.

connorsteph Dec 21, 2024
Author

@hawkinsp here's the full thread apply all bt dump: thread_apply_all_bt_dump.txt

BrandonAtomicAI · 2025-01-14T00:34:16Z

BrandonAtomicAI
Jan 14, 2025

Thank you for your help @hawkinsp and @mattjj.

Shortly after our initial post, Jax 0.4.38 was released, which we believed fixed the issue. (Jax 0.4.36 and 0.4.37 did not deadlock, but produced many warnings — we can provide logs if desired.)

That said, we’re posting this followup to bring it to your attention so that it may help some other poor soul and/or help avoid this issue from resurfacing in the future.

We finally managed to isolate the deadlock reproducibly in the following attached script. Unfortunately, this requires specific data files. The issue is a very rare edge case, and we were unable to simplify the following code to something with an obvious source.

Here’s what we know about triggering this deadlock:

We need a jax.lax.while_loop executed within a vmap
The condition cond_fun in the while_loop was defined with an inequality diff > tol; then deadlock occurs only when diff == tol
This condition diff == tol needs to be achieved as the input to cond_fun for the first iteration of the while_loop, so that the body_fun is never evaluated
There needs to be some level of complicated computations from within each iteration of the while_loop; we were unable to replicate the deadlock with something trivial that replicates the behavior of cond_fn without the internal computation.

This is GPU-agnostic, as we reproduced it on an a10g and an H100. However, as the numerical values are distinct we had to manually hack the value of tol so that the diff == tol condition would be achieved in the first iteration.

We’re sorry we were unable to further isolate this MWE, but we hope that the result is reproducible for you. If it is not, please let us know and we do what we can to ensure reproducibility.

If Jax 0.4.38 has fixed this issue for good, please let us know!

import equinox as eqx  # Any version >= 0.11.09
import jax
import jax.numpy as jnp
from jax import vmap

class Frame(eqx.Module):
    rotation: jnp.ndarray
    translation: jnp.ndarray

    def __matmul__(self, v: jnp.ndarray):
        assert v.shape == (3,)
        return self.rotation @ v + self.translation

def kabsch(w: jax.Array, A: jax.Array, B: jax.Array, eps: float = 1e-8) -> Frame:
    A_mean = (A * w[:, None]).sum(axis=0) / (w.sum() + eps)
    B_mean = (B * w[:, None]).sum(axis=0) / (w.sum() + eps)
    A_centered = A - A_mean[None, :]
    B_centered = B - B_mean[None, :]

    def _project_to_SO3(m):
        u, _, vt = jnp.linalg.svd(m, full_matrices=False)
        d = jnp.sign(jnp.linalg.det(vt.T @ u.T))
        return u @ jnp.diag(jnp.array([1, 1, d])) @ vt

    S = A_centered.T @ (w[:, None] * B_centered)
    R = _project_to_SO3(S).T

    t = B_mean - R @ A_mean
    return Frame(R, t)

def prox_kabsch_solver(
    p,
    displacements: jax.Array,
    F_anchor,
    *,
    max_iters: int,
    tol: float,
    lam: float,
    disable_hang: bool,
):
    n = p.shape[0]
    assert displacements.shape == (n, n, 3)
    p = p.at[jnp.diag_indices(n)].set(0)

    def _solve_subproblem(F_targ_i, F, displacements_row, displacements_col, p_row, p_col):
        A_1 = displacements_row
        B_1 = F.translation
        A_2 = jnp.zeros((n, 3))
        B_2 = vmap(lambda F_j, d_ji: F_j @ d_ji)(F, displacements_col)
        X = jnp.eye(3)
        Y = vmap(lambda v: F_targ_i @ v)(X)

        F_i = kabsch(
            jnp.concatenate((p_row, p_col, lam * jnp.ones(3))),
            jnp.concatenate((A_1, A_2, X)),
            jnp.concatenate((B_1, B_2, Y)),
        )
        return F_i

    def _parallel_block_coordinate_descent_step(F):
        F = vmap(
            lambda targ, dr, dc, pr, pc: _solve_subproblem(targ, F, dr, dc, pr, pc),
            in_axes=(0, 0, 1, 0, 1),
        )(F_anchor, displacements, displacements, p, p)
        return Frame(F.rotation, F.translation - F.translation.mean(axis=0, keepdims=True))

    def _compute_diff(F_old, F):
        diff_t = jnp.linalg.norm(F_old.translation - F.translation)
        diff_r = jnp.linalg.norm(F_old.rotation - F.rotation)
        diff = (diff_t + diff_r) / n
        return diff

    def body_fn(carry):
        i, _, F_old, _, diff_hist = carry
        F = _parallel_block_coordinate_descent_step(F_old)
        diff = _compute_diff(F_old, F)

        diff_hist = diff_hist.at[i + 1].set(diff)

        return i + 1, F_old, F, diff, diff_hist

    def cond_fn(carry):
        i, F_old, F, diff_carry, _ = carry

        diff = _compute_diff(F_old, F)

        # Can use a callback to disable the hang
        if disable_hang:
            jax.debug.callback(lambda _: None, diff)

        return (i < max_iters) & (diff > tol)

    # NOTE: We perform one iteration of the while loop manually
    # This ensures that `diff == tol` is achieved in `cond_fun` explicitly at the first iteration
    F_old, F = F_anchor, _parallel_block_coordinate_descent_step(F_anchor)
    diff = _compute_diff(F_old, F)

    diff_hist = jnp.zeros(max_iters, dtype=jnp.float32)
    diff_hist = diff_hist.at[0].set(diff)

    it, _, F, diff, diff_hist = jax.lax.while_loop(cond_fn, body_fn, (0, F_old, F, diff, diff_hist))
    return F, it, diff_hist

@eqx.filter_jit
def single_step(w, d, R, t, max_iters=4, tol=1e-1, lam=2.0, disable_hang=False):
    return vmap(
        lambda w, d, f: prox_kabsch_solver(
            w, d, f, max_iters=max_iters, tol=tol, lam=lam, disable_hang=disable_hang
        )
    )(w, d, Frame(R, t))

# Load data from npz file and convert to jax arrays
path_to_data = "/path/to/data"
data = jnp.load(f"{path_to_data}/data.npz")
data = {k: jnp.array(v) for k, v in data.items()}

print("single step: Non-exact tolerance -- Does not hang")
F, it, diff_hist = single_step(**data, tol=1e-1 - 1e-8, disable_hang=False)
print(it)
print(diff_hist)

# NOTE: On a H100 the tolerance of 1e-1 will work, but on other hardware may need to modify the tolerance
new_tol = 1e-1
new_tol = diff_hist[0, 0].item()

print("single step: Exact tolerance with disable_crash=True -- Does not hang")
F, it, diff_hist = single_step(**data, tol=new_tol, disable_hang=True)
print(it)
print(diff_hist)

print("single step: Exact tolerance -- hangs")
F, it, diff_hist = single_step(**data, tol=new_tol, disable_hang=False)
print(it)
print(diff_hist)

Data file to run mwe.py: data.npz.zip

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible regression in jaxlib 0.4.25+ causing training deadlocks on GPU #25453

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Possible regression in jaxlib 0.4.25+ causing training deadlocks on GPU #25453

Uh oh!

connorsteph Dec 12, 2024

Regression description:

Attempt at diagnostics:

Replies: 2 comments · 3 replies

Uh oh!

hawkinsp Dec 20, 2024 Maintainer

Uh oh!

mattjj Dec 20, 2024 Maintainer

Uh oh!

connorsteph Dec 20, 2024 Author

Uh oh!

Uh oh!

connorsteph Dec 21, 2024 Author

Uh oh!

BrandonAtomicAI Jan 14, 2025

connorsteph
Dec 12, 2024

Replies: 2 comments 3 replies

hawkinsp
Dec 20, 2024
Maintainer

mattjj Dec 20, 2024
Maintainer

connorsteph Dec 20, 2024
Author

connorsteph Dec 21, 2024
Author

BrandonAtomicAI
Jan 14, 2025