Performance regression while scaling up GPUs when computing deteriminants. #28675

Kieran-Loehr · 2025-05-11T23:27:06Z

Kieran-Loehr
May 11, 2025

Hi everyone,

I am currently trying to parallelize some code that I have written over multiple GPUs. I have followed the "Distributed arrays and automatic parallelization" guide on the webpage and found that for the example network provided I get a significant speedup when working with up to four GPUs. However, when I was trying to do the same thing on my own code the speedup was much less significant, and it actually slowed down when going from 2 GPUs to 4 GPUs. From testing with a small example it seems like the issue is with parallelizing over determinants. Is there any known issue with parallelizing over determinants? Thanks in advance for any help and I have attached the script I was testing with below with its outputs.

import jax
import jax.numpy as jnp
from jax.sharding import Mesh, PartitionSpec, NamedSharding
P = PartitionSpec
print(jax.__version__)
print(jax.devices())

# Network definition from https://docs.jax.dev/en/latest/notebooks/Distributed_arrays_and_automatic_parallelization.html

matSize = 32
def predict(params, inputs):
  for W, b in params:
    outputs = jnp.dot(inputs, W) + b
    inputs = jnp.maximum(outputs, 0)
  return outputs

def loss(params, batch):
  inputs, targets = batch
  predictions = predict(params, inputs)
  matrices = (predictions - targets).reshape((-1,matSize,matSize))
  return jnp.mean(jnp.sum(matrices, axis=(-1,-2)))

def loss_det(params, batch):
  inputs, targets = batch
  predictions = predict(params, inputs)
  matrices = (predictions - targets).reshape((-1,matSize,matSize))
  return jnp.mean(jnp.linalg.slogdet(matrices)[1])

loss_jit = jax.jit(loss)
gradfun = jax.jit(jax.grad(loss))

loss_jit_det = jax.jit(loss_det)
gradfun_det = jax.jit(jax.grad(loss_det))

def init_layer(key, n_in, n_out):
  k1, k2 = jax.random.split(key)
  W = jax.random.normal(k1, (n_in, n_out)) / jnp.sqrt(n_in)
  b = jax.random.normal(k2, (n_out,))
  return W, b

def init_model(key, layer_sizes, batch_size):
  key, *keys = jax.random.split(key, len(layer_sizes))
  params = list(map(init_layer, keys, layer_sizes[:-1], layer_sizes[1:]))

  key, *keys = jax.random.split(key, 3)
  inputs = jax.random.normal(keys[0], (batch_size, layer_sizes[0]))
  targets = jax.random.normal(keys[1], (batch_size, layer_sizes[-1]))

  return params, (inputs, targets)

layer_sizes = [784, 8192, 8192, 8192, matSize*matSize]
batch_size = 8192

params, batch = init_model(jax.random.key(0), layer_sizes, batch_size)

# Mesh over all four devices
mesh_4 = Mesh(jax.devices()[:4], axis_names=('batch',))
sharding_4 = NamedSharding(mesh_4, P('batch'))
replicated_sharding_4 = NamedSharding(mesh_4, P())

batch_4 = jax.device_put(batch, sharding_4)
params_4 = jax.device_put(params, replicated_sharding_4)

gradfun(params_4, batch_4)[0][0].block_until_ready()
gradfun_det(params_4, batch_4)[0][0].block_until_ready()
print('\nTiming on four GPUs with sums')
%timeit -n 5 -r 5 gradfun(params_4, batch_4)[0][0].block_until_ready()
print('\nTiming on four GPUs with determinants')
%timeit -n 5 -r 5 gradfun_det(params_4, batch_4)[0][0].block_until_ready()

# Mesh over two devices
mesh_2 = Mesh(jax.devices()[:2], axis_names=('batch',))
sharding_2 = NamedSharding(mesh_2, P('batch'))
replicated_sharding_2 = NamedSharding(mesh_2, P())

batch_2 = jax.device_put(batch, sharding_2)
params_2 = jax.device_put(params, replicated_sharding_2)

gradfun(params_2, batch_2)[0][0].block_until_ready()
gradfun_det(params_2, batch_2)[0][0].block_until_ready()
print('\nTiming on two GPUs with sums')
%timeit -n 5 -r 5 gradfun(params_2, batch_2)[0][0].block_until_ready()
print('\nTiming on two GPUs with determinants')
%timeit -n 5 -r 5 gradfun_det(params_2, batch_2)[0][0].block_until_ready()

# Mesh over one device
mesh_1 = Mesh(jax.devices()[:1], axis_names=('batch',))
sharding_1 = NamedSharding(mesh_1, P('batch'))
replicated_sharding_1 = NamedSharding(mesh_1, P())

batch_1 = jax.device_put(batch, sharding_1)
params_1 = jax.device_put(params, replicated_sharding_1)

gradfun(params_1, batch_1)[0][0].block_until_ready()
gradfun_det(params_1, batch_1)[0][0].block_until_ready()
print('\nTiming on one GPU with sums')
%timeit -n 5 -r 5 gradfun(params_1, batch_1)[0][0].block_until_ready()
print('\nTiming on one GPU with determinants')
%timeit -n 5 -r 5 gradfun_det(params_1, batch_1)[0][0].block_until_ready()

0.4.16
[gpu(id=0), gpu(id=1), gpu(id=2), gpu(id=3)]

Timing on four GPUs with sums
19.5 ms ± 1.41 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Timing on four GPUs with determinants
48.8 ms ± 2.5 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Timing on two GPUs with sums
33 ms ± 108 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Timing on two GPUs with determinants
38.4 ms ± 1 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Timing on one GPU with sums
56.7 ms ± 145 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Timing on one GPU with determinants
59.5 ms ± 368 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Answered by dfm

May 12, 2025

Good question! It looks like slogdet is backed by either a QR or LU decomposition. These decompositions are themselves backed by calls to cuSOLVER on NVIDIA GPUs. Until recently (JAX v0.5.3, I think), these library calls didn't support sharding like this out of the box. If you try your experiment with the latest version of JAX, I predict that you will see the scaling you expect. For older versions of JAX like the one you're using, the recommendation would be to use shard_map.

View full answer

dfm · 2025-05-12T14:07:56Z

dfm
May 12, 2025
Collaborator

Good question! It looks like slogdet is backed by either a QR or LU decomposition. These decompositions are themselves backed by calls to cuSOLVER on NVIDIA GPUs. Until recently (JAX v0.5.3, I think), these library calls didn't support sharding like this out of the box. If you try your experiment with the latest version of JAX, I predict that you will see the scaling you expect. For older versions of JAX like the one you're using, the recommendation would be to use shard_map.

2 replies

Kieran-Loehr May 12, 2025
Author

Thanks so much for the reply. I am running my code on a cluster, so I have reached out to hopefully get it upgraded to at least v0.5.3. However, I also ran this experiment again using shard_map from the "Manual parallelism with shard_map" guide using the Neural Network toy example provided. I had to modify the code slightly to get it to run on my version, the main thing being adding check_rep=False when I call shard_map, but it seems to show the same behavior as before. I found that when evaluating losses using the sum, as I increase the number of GPUs from 1->2->4 the timing goes from 20.4->11.4->6.6 ms which is what I would expect. However, if I include the determinant the timing is relatively constant at 20.2->19.5->19.8. So it seems like it isn't getting any significant speed up. In the initial question I was looking at the gradients, and for those I get a speedup of 58->33.3->18.3 with a sum, but with the determinant it is 61->44->52.6, and I see it consistently takes longer with 4 GPUs than with 2. I have attached the updated code using shard_map and the results, but would appreciate any other thoughts on why this inconsistency is occurring.

As another note, I also tested this with 2x2 matrices and explicitly computed the determinant. I found that this had similar speed ups as the sum, and so was significantly faster than the determinant while giving the same answers. This is perhaps an easy work around for small matrices, but seems unfeasible for larger matrices. Additionally, the performance of using jnp.log(jnp.abs(jnp.linalg.det(M))) seems to better than that of jnp.slogdet(M)[1] for the 2x2 matrices, but they have similar performance for larger matrices (M=36x36). Thanks again for any help you can give. At the end of the day, I likely need to use slogdet to avoid overflow issues with large matrices, so I would appreciate understanding why it doesn't give speedups while parallelizing.

import jax
import jax.experimental
import jax.numpy as jnp
from jax.sharding import Mesh, PartitionSpec, NamedSharding
from functools import partial
P = PartitionSpec
print(jax.__version__)
print(jax.devices())

matSize = 2
def predict(params, inputs):
  for W, b in params:
    outputs = jnp.dot(inputs, W) + b
    inputs = jnp.maximum(outputs, 0)
  return outputs

def init_layer(key, n_in, n_out):
  k1, k2 = jax.random.split(key)
  W = jax.random.normal(k1, (n_in, n_out)) / jnp.sqrt(n_in)
  b = jax.random.normal(k2, (n_out,))
  return W, b

def init_model(key, layer_sizes, batch_size):
  key, *keys = jax.random.split(key, len(layer_sizes))
  params = list(map(init_layer, keys, layer_sizes[:-1], layer_sizes[1:]))

  key, *keys = jax.random.split(key, 3)
  inputs = jax.random.normal(keys[0], (batch_size, layer_sizes[0]))
  targets = jax.random.normal(keys[1], (batch_size, layer_sizes[-1]))

  return params, (inputs, targets)

layer_sizes = [784, 8192, 8192, 8192, matSize*matSize]
batch_size = 8192

params, batch = init_model(jax.random.key(0), layer_sizes, batch_size)

# Compare run times on multiple GPUs
numGPUs = [1,2,4]
for numGPU in numGPUs:
  mesh_n = Mesh(jax.devices()[:numGPU], axis_names=('batch',))
  sharding_n = NamedSharding(mesh_n, P('batch'))
  replicated_sharding_n = NamedSharding(mesh_n, P())

  batch_n = jax.device_put(batch, sharding_n)
  params_n = jax.device_put(params, replicated_sharding_n)

  def loss_dp_n(params, batch):
    @partial(jax.experimental.shard_map.shard_map, mesh=mesh_n, in_specs=P('batch', None), out_specs=P(), check_rep=False)
    def loss_spmd(local_batch):
      inputs, targets = local_batch
      predictions = predict(params, inputs)  # use reference 'predict`
      matrices = (predictions - targets).reshape((-1,matSize,matSize))
      local_loss = jnp.mean(jnp.sum(matrices, axis=(-1,-2)))
      # The following line will explicitly compute the determinant and will still give similar speeds as the sum
      #local_loss = jnp.mean(jnp.log(jnp.abs(matrices[:,0,0]*matrices[:,1,1]-matrices[:,0,1]*matrices[:,1,0])))
      return jax.lax.pmean(local_loss, 'batch')
    return loss_spmd(batch)

  def loss_dp_det_n(params, batch):
    @partial(jax.experimental.shard_map.shard_map, mesh=mesh_n, in_specs=P('batch', None), out_specs=P(), check_rep=False)
    def loss_spmd(local_batch):
      inputs, targets = local_batch
      predictions = predict(params, inputs)  # use reference 'predict`
      matrices = (predictions - targets).reshape((-1,matSize,matSize))
      local_loss = jnp.mean(jnp.linalg.slogdet(matrices)[1])
      return jax.lax.pmean(local_loss, 'batch')
    return loss_spmd(batch)

  loss_jit = jax.jit(loss_dp_n)
  gradfun = jax.jit(jax.grad(loss_dp_n))

  loss_jit_det = jax.jit(loss_dp_det_n)
  gradfun_det = jax.jit(jax.grad(loss_dp_det_n))

  loss_jit(params_n, batch_n).block_until_ready()
  loss_jit_det(params_n, batch_n).block_until_ready()
  print(f'\nLoss timing on {numGPU} GPUs with sums')
  %timeit -n 5 -r 5 loss_jit(params_n, batch_n).block_until_ready()
  print(f'\nLoss timing on {numGPU} GPUs with determinants')
  %timeit -n 5 -r 5 loss_jit_det(params_n, batch_n).block_until_ready()

  gradfun(params_n, batch_n)[0][0].block_until_ready()
  gradfun_det(params_n, batch_n)[0][0].block_until_ready()
  print(f'\nGradient timing on {numGPU} GPUs with sums')
  %timeit -n 5 -r 5 gradfun(params_n, batch_n)[0][0].block_until_ready()
  print(f'\nGradient timing on {numGPU} GPUs with determinants')
  %timeit -n 5 -r 5 gradfun_det(params_n, batch_n)[0][0].block_until_ready()

0.4.16
[gpu(id=0), gpu(id=1), gpu(id=2), gpu(id=3)]

Loss timing on 1 GPUs with sums
20.4 ms ± 1.27 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Loss timing on 1 GPUs with determinants
20.2 ms ± 248 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Gradient timing on 1 GPUs with sums
58 ms ± 142 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Gradient timing on 1 GPUs with determinants
61 ms ± 141 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Loss timing on 2 GPUs with sums
11.4 ms ± 791 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Loss timing on 2 GPUs with determinants
19.5 ms ± 460 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Gradient timing on 2 GPUs with sums
33.3 ms ± 118 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Gradient timing on 2 GPUs with determinants
44 ms ± 203 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Loss timing on 4 GPUs with sums
6.62 ms ± 80.3 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Loss timing on 4 GPUs with determinants
19.8 ms ± 373 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Gradient timing on 4 GPUs with sums
18.3 ms ± 213 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Gradient timing on 4 GPUs with determinants
52.6 ms ± 1.69 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Kieran-Loehr May 20, 2025
Author

I upgrade to JAX version 0.6.0 and it does seem to fix the issue. Here is the output which shows the expected behavior.

0.6.0
[CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3)]

Loss timing on 1 GPUs with sums
19.9 ms ± 133 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Loss timing on 1 GPUs with determinants
21.1 ms ± 78.5 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Gradient timing on 1 GPUs with sums
57 ms ± 193 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Gradient timing on 1 GPUs with determinants
61.2 ms ± 220 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Loss timing on 2 GPUs with sums
13.8 ms ± 2.69 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Loss timing on 2 GPUs with determinants
11.2 ms ± 66.9 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Gradient timing on 2 GPUs with sums
35.9 ms ± 2.54 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Gradient timing on 2 GPUs with determinants
36.4 ms ± 61.3 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Loss timing on 4 GPUs with sums
6.73 ms ± 103 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Loss timing on 4 GPUs with determinants
7.07 ms ± 107 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Gradient timing on 4 GPUs with sums
21.3 ms ± 1.66 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Gradient timing on 4 GPUs with determinants
20.6 ms ± 110 μs per loop (mean ± std. dev. of 5 runs, 5 loops each)

PhilipVinc · 2025-05-12T22:25:19Z

PhilipVinc
May 12, 2025

(A side note, but do note that jax<0.4.36 had a nasty bug where if you had a large batch size, the output of det and slogdet would be wrong, so you really should upgrade to something more recent)

#24843

1 reply

dfm May 13, 2025
Collaborator

I think 0.4.16 is old enough that it won't have that bug :D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance regression while scaling up GPUs when computing deteriminants. #28675

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Performance regression while scaling up GPUs when computing deteriminants. #28675

Uh oh!

Uh oh!

Kieran-Loehr May 11, 2025

Replies: 2 comments · 3 replies

Uh oh!

dfm May 12, 2025 Collaborator

Uh oh!

Kieran-Loehr May 12, 2025 Author

Uh oh!

Kieran-Loehr May 20, 2025 Author

Uh oh!

PhilipVinc May 12, 2025

Uh oh!

dfm May 13, 2025 Collaborator

Kieran-Loehr
May 11, 2025

Replies: 2 comments 3 replies

dfm
May 12, 2025
Collaborator

Kieran-Loehr May 12, 2025
Author

Kieran-Loehr May 20, 2025
Author

PhilipVinc
May 12, 2025

dfm May 13, 2025
Collaborator