Poor Jax Sparse Performance with GPU #8531

brosand · 2021-11-12T22:06:35Z

brosand
Nov 12, 2021

We are using Jax for some simulations, often involving highly sparse matrices. On cpus Jax has a clear performance jump with sparse matrices and sparse evaluation. However when transferred to the GPU we see a slowdown, with sparse jax evaluation slower than dense jax. Is this expected? I imagine there may be some optimization at the lower levels that has not taken place yet. We have tested with square and rectangular matrices, as well as 1d arrays, all with similar results. I have attached the code for the rectangular matrix as an example. Is there any expectation that gpu sparse evaluation will improve?

jax_discussion_upload.txt

Answered by jakevdp

Nov 12, 2021

Hi - thanks for the question. In general, unless you have very sparse matrices, I would not expect sparse versions of matrix products to be faster than dense versions of matrix products, particularly on accelerators like GPU and TPU. This is not just a statement about JAX – I would expect this to hold for virtually any sparse and dense matrix algebra libraries.

Why? Accelerators like GPU and TPU are specifically designed for dense linear algebra, and take advantage of decades of engineering best practices for those specific operations. These optimizations rely on things like data locality guarantees & look aheads, the ability to blindly scan over standard data layouts in parallel. Sparse …

View full answer

jakevdp · 2021-11-12T22:30:58Z

jakevdp
Nov 12, 2021
Maintainer

Hi - thanks for the question. In general, unless you have very sparse matrices, I would not expect sparse versions of matrix products to be faster than dense versions of matrix products, particularly on accelerators like GPU and TPU. This is not just a statement about JAX – I would expect this to hold for virtually any sparse and dense matrix algebra libraries.

Why? Accelerators like GPU and TPU are specifically designed for dense linear algebra, and take advantage of decades of engineering best practices for those specific operations. These optimizations rely on things like data locality guarantees & look aheads, the ability to blindly scan over standard data layouts in parallel. Sparse methods don't satisfy any of these constraints (e.g. they tend to be very non-local), and so will be much slower. Again, this has nothing to do with JAX: this will be true of any sparse algorithm implemented on hardware fundamentally designed for dense operations.

So why does JAX have experimental sparse support at all? Well, there are occasions when it is useful, particularly for extremely sparse matrices, or for situations where the dense computation could not be done because of memory constraints. If you're in that regime, I'd suggest using JAX sparse. If you're mainly concerned with 1000x1000 diagonal matrices, just use dense representations and let the hardware do its thing.

That being said, you can expect JAX sparse operations on GPU to be an order of magnitude faster in the very near future. We're actively working on GPU lowerings to cusparse for supported operations; the first part of that work will hopefully land in the main branch this afternoon (see #8514)

1 reply

brosand Nov 15, 2021
Author

Thanks Jake! This makes a lot of sense.

shailesh1729 · 2021-11-18T08:53:24Z

shailesh1729
Nov 18, 2021

@brosand, looking at your code, it appears that your sparse matrices are structured in the sense that they contain only a single sub/super diagonal which have non-zero entries while the rest of the matrix is all 0.

Have you considered a different approach where structured matrices can be represented as linear operators which can provide custom and fast implementation of A x and A^H x operations? Then such linear operators can be easily used within larger JAX based programs. This approach works well if you don't need to make any changes in A once the linear operator has been constructed.

The CR-Sparse library contains a good set of linear operators built on top of JAX. They can be JIT compiled and passed as static arguments to other functions which can be JIT compiled. You can check out the code for a diagonal operator which represents computations with a diagonal matrix efficiently. I am sure, custom logic for sub-diagonal and super-diagonal matrices can also be written. The examples gallery has several examples for these linear operators in action. The design of this idea is pretty simple and you can easily build something like this on your own for your specific needs.

5 replies

brosand Nov 18, 2021
Author

This is very interesting, I will look into this. Does the CR-Sparse library enable us to take advantage of the massive GPU speedups for denser matrices?

jakevdp Nov 18, 2021
Maintainer

The performance discussion in my answer is relevant to unstructured sparse matrices. If you have highly-structured sparse matrices (like diagonals), then you can express operations much more efficiently: for example matrix-vector multiplication with a diagonal matrix can be expressed as a pointwise product of the diagonal and the vector, which would be very efficient on a GPU; I think that sort of thing is what @shailesh1729 is referring to.

shailesh1729 Nov 18, 2021

Yes. For many sparse problems, the matrices tend to be structured and this can be exploited in coming up with efficient code. For unstructured, sparse matrices, jax.experimental.sparse.bcoo is the solution. There is no single solution that is the best always.

@brosand, you can see some of my Juypter notebooks for benchmarking CR-Sparse on GPU vs CPU here.
I have tabulated some results from different experiments here. You can try the code on your own too.

I would also mention that in one of the problems I was working on (SSC-OMP), the jax.experimental.sparse.bcoo formatted unstructured sparse matrices helped me solve huge problems on a very small GPU. Here is the example Juypter notebook fully using JAX sparse matrices.

brosand Nov 18, 2021
Author

This is great. A lot of our matrices are sparse so if we can take advantage of both sparse computation and gpu acceleration it would be valuable. I can try to upload some of our benchmarking as well in the future to show some of the other speedups we have seen. Is there any interest from the Jax team in having some lists of projects and relevant benchmarks for various uses of Jax? Looping in my project lead @DanPuzzuoli so he can be updated on this conversation.

shailesh1729 Nov 19, 2021

There is awesome-jax list. I guess that might fit?

Poor Jax Sparse Performance with GPU #8531

Uh oh!

brosand Nov 12, 2021

Replies: 2 comments · 6 replies

Uh oh!

Uh oh!

jakevdp Nov 12, 2021 Maintainer

Uh oh!

brosand Nov 15, 2021 Author

Uh oh!

shailesh1729 Nov 18, 2021

Uh oh!

brosand Nov 18, 2021 Author

Uh oh!

Uh oh!

jakevdp Nov 18, 2021 Maintainer

Uh oh!

shailesh1729 Nov 18, 2021

Uh oh!

brosand Nov 18, 2021 Author

Uh oh!

shailesh1729 Nov 19, 2021

brosand
Nov 12, 2021

Replies: 2 comments 6 replies

jakevdp
Nov 12, 2021
Maintainer

brosand Nov 15, 2021
Author

shailesh1729
Nov 18, 2021

brosand Nov 18, 2021
Author

jakevdp Nov 18, 2021
Maintainer

brosand Nov 18, 2021
Author