GPU implementation of birth-death sampling million times slower? #18461

leviyevalex · 2023-11-09T18:13:49Z

leviyevalex
Nov 9, 2023

Hi all,

I am new to JAX and was hoping to get some feedback on my implementation of a birth-death sampling algorithm. My JIT compiled GPU version is a million times slower than a standard Numpy implementation according to my benchmarks!

The logic is fairly simple:

Input:

$N \in \mathbb{N}$ particles, ${x_n}$ with $1 \le n \le N$
$\Lambda \in \mathbb{R}^N$ which represents an "excess" or "deficit" of mass at the particle locations, depending on $sign(\Lambda)$

If there is an excess of mass, then $\Lambda_n > 0$ and that particle will "teleport" to another randomly chosen particle. On the other hand, if a particle is in a "deficit" region $\Lambda_n <0$, then another randomly chosen particle will teleport to that location.

Note: The algorithm requires keeping track of which particles have already teleported.

In standard Numpy, the logic is as follows:

import numpy as np

nParticles = 500
Lambda = np.random.uniform(-100, 100, nParticles)
def birthDeathJumpIndicies(Lambda, stepsize=0.01):
    nParticles = Lambda.shape[0]
    alive = ListDict(nParticles)
    r = np.random.uniform(low=0, high=1, size=nParticles)
    threshold = r < 1 - np.exp(-np.abs(Lambda) * stepsize)
    idxs = np.argwhere(threshold)[:, 0]
    np.random.shuffle(idxs)

    # Particle jumps
    output = np.arange(nParticles)
    for i in idxs:
        if i in alive:
            j = alive.choose_random_item()
            if Lambda[i] > 0:
                output[i] = j
                alive.remove_item(i)
            elif Lambda[i] < 0:
                output[j] = i 
                alive.remove_item(j)

    return output

where I use the following to keep track of which particles have already been teleported

import random
class ListDict(object):
    """  
    Solution adapted from 
    https://stackoverflow.com/questions/15993447/python-data-structure-for-efficient-add-remove-and-random-choice
    Data structure with efficient
    (i)   lookup
    (ii)  uniform random selection
    (iii) removal

    """
    def __init__(self, nParticles):
        self.item_to_position = {}
        self.items = []
        for n in range(nParticles):
            self.add_item(n)

    def add_item(self, item):
        if item in self.item_to_position:
            return
        self.items.append(item)
        self.item_to_position[item] = len(self.items)-1

    def remove_item(self, item):
        position = self.item_to_position.pop(item)
        last_item = self.items.pop()
        if position != len(self.items):
            self.items[position] = last_item
            self.item_to_position[last_item] = position

    def choose_random_item(self):
        return random.choice(self.items)

    def __contains__(self, item):
        return item in self.item_to_position

    def __iter__(self):
        return iter(self.items)

    def __len__(self):
        return len(self.items)

This is my best attempt at writing a compilable version of the previous code. I keep track of particles that are alive/dead with a binary array of 1's and 0's, and use that as a unnormalized probability mass function in the jax.random.choice function to select particles. This is computationally inefficient compared to the first approach, but I am stuck on how else to perform this operation.

import jax
import jax.numpy as jnp

key = jax.random.PRNGKey(0)

def true_fun(i, j, jumps):
    jumps = jumps.at[0,i].set(j)
    jumps = jumps.at[1,i].set(0)
    return jumps

def false_fun(i, j, jumps):
    jumps = jumps.at[0,j].set(i) 
    jumps = jumps.at[1,j].set(0)
    return jumps

def scan_func(carry, x):
    key, jumps, Lambda = carry 
    pred = Lambda[x] > 0
    key, subkey = jax.random.split(key)
    j = jax.random.choice(key, Lambda.shape[0], p=jumps[1]) # Not choosing dead particles
    # j = jax.random.choice(key, Lambda.shape[0]) # TODO this is for debugging purposes. see if this is causing the slowdown!
    jumps = jax.lax.cond(x != -1, lambda: jax.lax.cond(pred, true_fun, false_fun, *(x, j, jumps)), lambda: jumps)
    return (key, jumps, Lambda), jumps

@jax.jit
def birth_death_jumps(key, Lambda, stepsize=0.01):
    nParticles = Lambda.shape[0]
    r = jax.random.uniform(minval=0, maxval=1, shape=Lambda.shape, key=key)
    threshold = r < 1 - jnp.exp(-jnp.abs(Lambda) * stepsize)
    idxs = jax.random.permutation(key, jnp.argwhere(threshold, size=nParticles, fill_value=-1).squeeze())

    jumps = jnp.zeros((2, nParticles), dtype=int)
    jumps = jumps.at[0].set(jnp.arange(nParticles))
    jumps = jumps.at[1].set(jnp.ones(nParticles, dtype=int))

    init = (key, jumps, Lambda)
    jumps = jax.lax.scan(scan_func, init, idxs)

    return jumps[0][1][0]

Any thoughts or suggestions?

jakevdp · 2023-11-09T18:36:55Z

jakevdp
Nov 9, 2023
Maintainer

I think the issue here is that your algorithm is one that is very ill-suited to run on GPU, no matter how you express it.

GPUs in general are very good at running implicitly parallelized vector operations over arrays; modern CPUs are OK in this regime, but will not match the speed of GPUs for such problems. GPUs are very bad at doing sequential operations over single values stored in memory; however this is the regime where CPUs excel.

Your computation deals sequentially with individual array elements, with no possible parallelization, because the input state of each step explicitly depends on the output state of the previous step: thus it falls squarely within the latter regime, where CPU excels and GPU architectures are not well-suited. Given this, I'd expect that no matter how you express it (even writing a custom-tuned CUDA kernel) you'll never get this to run as fast on GPU as it does on CPU.

You may find a use for GPUs in this problem if, say, you are hoping to run many such procedures at once. Then each step in the sequential procedure could be parallelized to take advantage of the GPU hardware, and you could likely do much better than the CPU, which would essentially have to run each of the many sequences individually.

Does that make sense?

0 replies

leviyevalex · 2023-11-09T18:49:35Z

leviyevalex
Nov 9, 2023
Author

Yes it does! Thank you so much for your fast response.

One strategy would then be to identify and perform computations involving parallel computations on GPU, and perform sequential logic on CPU.

Are there any good practices in mixing computations of this type? Or it shouldn't matter you think?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU implementation of birth-death sampling million times slower? #18461

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GPU implementation of birth-death sampling million times slower? #18461

Uh oh!

leviyevalex Nov 9, 2023

Replies: 2 comments

Uh oh!

Uh oh!

jakevdp Nov 9, 2023 Maintainer

Uh oh!

leviyevalex Nov 9, 2023 Author

leviyevalex
Nov 9, 2023

jakevdp
Nov 9, 2023
Maintainer

leviyevalex
Nov 9, 2023
Author