DRAM Bandwidth Compression #148

electrojustin · 2025-07-08T21:59:14Z

electrojustin
Jul 8, 2025

Hi!
I've been experimenting a little with implementing DRAM bandwidth compression algorithms for tensor products to try to alleviate some of the bus pressure, since as I understand most implementations are memory bound. I have a little proof of concept here which seems to run 2-3X faster than e3nn's implementation. The Clebsch-Gordon coefficients and indices are pretty low in shannon entropy, so I was able to find some low hanging fruit.

Do you folks think these kinds of strategies might be worth exploring further? If I understand these templates correctly, OpenEquivariance takes a totally different approach and just inlines everything and encodes the CBs and metadata as immediates? But I'm wondering if DRAM bandwidth compression might complement some of the other optimization strategies here. For example, my prototype delta compresses the input indices in groups of exactly three to avoid introducing branches in the kernel. This often results in either wasteful padding bytes when adjacent indices can't be compressed together, or encoding indices in uncompressed form needlessly. But the exact pattern of which indices can be compressed together and which cannot is solely determined from the irreps, so there should be a way to JIT compile an ideal compression scheme.

vbharadwaj-bk · 2025-07-08T23:05:15Z

vbharadwaj-bk
Jul 8, 2025
Maintainer

Hey @electrojustin - cool work! I took a brief look at it.

My perspective (Austin has some thoughts as well). It could be useful; keep a few things in mind, though:

For many reasonable models, the main source of memory bandwidth consumption is the loads / stores of the input and output vectors, not the CG coefficients and their indices. The output of a TP can range up 20K data words, which dwarfs the other relevant quantities.
In your implementation, you've unrolled the CG index loop (as we do, albeit via JIT). The sequence of decoding steps is not trivial (even if the processor can use instruction-level parallelism to hide it), and you still have 2 multiplications and addition per coefficient. As coefficient tensors get larger, loading the long sequence of unrolled instructions itself from main memory to the instruction cache becomes an issue - those aren't free. We see that we eventually spill the instruction cache size as we inline more and more operations, especially on AMD arches.

So just keep in mind that unrolling and decoding in your hot loop has a cost. We just found this cost worth it, given the advantages of inlining / immediates.

Your decoding logic introduces a longer "critical path" - the multiplications / addition cannot execute until the index decodes finish, which hurts your instruction-level parallelism and occupancy of the floating point engines.

There are some different opinions on this , so feel free to ask around (or just hack our kernel generator, or build your own). We used Jinja, a Python templating engine, to quickly rig a JIT compiler. No substitute for benchmarking :)

1 reply

vbharadwaj-bk Jul 8, 2025
Maintainer

Also keep in mind that e3nn has several overheads (e.g. Python launch of kernels, etc. etc.) that can be reduced with torch.compile or just come with the territory of being a general, cross-platform library. Benchmarking with large batch sizes and / or measuring bandwidth consumption and / or testing against cuEquivariance or OpenEquivariance can give you a more detailed picture of the speedup.

electrojustin · 2025-07-09T00:59:34Z

electrojustin
Jul 9, 2025
Author

Thank you so much for taking a look and responding @vbharadwaj-bk !

Re benchmarking: Sorry I should have led with some details about my benchmarks.
Earlier in my effort I made the mistake of not using large enough batch sizes, but I think I've since corrected that. I use a batchsize of 10000, I call torch.compile() on the e3nn TP, I have 10 warmup rounds, and 100 test rounds. I use inputs of irreps "16x0e + ... + 16x{L max}" to simulate a workload of multiplicity 16 spherical harmonics, and I tested L max values of up to 4. I tested on both an A100 and my poor old GTX 1070, although for the latter I was only able to get up to L max of 2 before running into OOM issues. The benchmark code is in the same repo as my implementation: https://github.com/electrojustin/tensor_product_experiments/blob/main/benchmark.py
The full results can be found in this google sheet.
The summary though is my implementation got a 2-3X speedup on all the A100 tests ! Edit: except Lmax=4, where it's only about a 1.8X speedup
Not as good as OpenEquivariance though, which is why I was sort of hoping this technique might augment some of yours :-)

Some thoughts about your technical comments.

This is something I thought at first too, but I'm no longer so sure about. I printed some of these quantities out for my benchmark workloads and found that for a given output word of a rank 2 spherical harmonic TP, there are, on average, almost exactly three non-zero input terms. Because the size of the input vectors is the square root of the size of the output vector, I've found that in practice these can often be stored directly in shared memory, and thus avoid incurring DRAM bandwidth overhead (although sometimes bank conflicts are an issue). If we wait until we've accumulated all ~3 products together before writing out the result to the output vector, then I believe the actual bottleneck lies in the 1 output index word + 3 * 2 input index words + 3 CB coefficient words that need to be loaded in order to store 1 output word. That's what lead me down this compression rabbit hole in the first place :-)

Of course, that analysis is predicated on only writing to the output vector after the accumulations are done, which is tricky to do without branching. If you use "+=" after every multiplication, that's certainly going to generate a lot of bus traffic with the extra load/store. "atomicAdd()" seems to perform significantly better despite the atomicity constraints, which makes me think Nvidia's DDR controller has some special dedicated support for that operation and it's effectively just a store from the bus's perspective. My little opportunistic clustering algorithm seems to cut the number of atomicAdd() calls roughly in half compared to the naive approach for my rank 2 benchmark, but again, ideally we should be cutting that by a factor of 3.

Yeah I added that #pragma unroll directive and saw a slight performance boost, but it was very tiny. But the loop iterations in my kernel are determined at runtime, so I'm actually not sure what, if anything, the compiler is actually doing here? I'm wondering if it emits something close to a Duff's Device, and just unrolls until it reasonably thinks the resulting kernel will fit in icache. I haven't tried disassembling the binary yet to check. I guess at any rate, I'm not sure I've seen much evidence that I'm blowing up my icache already, but that's definitely something to look out for in any kind of JIT route.

Did you find that encoding the coefficients and indices as immediates negatively impacted your icache? I kind of assumed that this would effectively shift the bulk of your data loads from dcache to icache, but if Nvidia uses, for example, a fixed width instruction set, then maybe mixing data and code is effectively a compression algorithm in its own right, since you're already paying the cost of loading a full instruction word regardless of whether the operand is an immediate or a register. Unless of course 32-bit immediates needs to be loaded with multiple instructions ala ARM or RISC-V...

Yeah the latency of one loop iteration is likely bad, but my thought process was that a modern GPU should have a deep enough pipeline and a wide enough decode that it won't really matter. As long as we're bandwidth bound rather than arithmetic bound, is there any way for the critical path to actually affect the overall throughput (assuming we have enough data to keep the pipes full)?

0 replies

vbharadwaj-bk · 2025-07-09T01:16:59Z

vbharadwaj-bk
Jul 9, 2025
Maintainer

Pretty cool! Benchmarking looks solid. I'll be in and out of responding since this week is a bit busy.

Interesting.. you might want to try the experiment with some longer input irreps and the Channelwise TPs in MACE / Nequip rather than FullTensorProduct.

Another question: why not store input_indices, the object you are decoding, in shared memory? Seems to be the same for every batch element, not sure if you have to load it from global memory each time. Or am I wrong?

16 replies

electrojustin Jul 13, 2025
Author

I'm definitely going to check those out! I have to implement output irrep filtering now though because those input configurations OOM otherwise. I've been holding off on those so far because I wanted to see if this idea was even viable before I bothered to productionalize. But, I'm seeing some evidence now that I'm fully saturating the arithmetic capacity of the A100, so I think this might be worth the effort.

I've devised a kind of "high water mark" benchmark where I just take the outer product of the irreps instead of doing a full CB contraction - I figured there's basically no way to get a tensor product implementation faster than that since it's one FLOP and a DRAM write per output element. For my rank 2 spherical harmonic benchmark, I see that the high water mark test takes 0.075 seconds to run, whereas my TP implementation takes 0.122 seconds, so about 1.62X slower. This particular configuration has on average ~3 product terms per output element, which means 2 fused mul-adds and 4 unfused muls, for an average of 6 FLOPS per output element. If the A100 processes 4 FLOPs per clock cycle, then we would expect a TP to be 6/4=1.5X slower than the high water mark, so it looks like we're getting pretty close! But of course this is a pretty back of the napkin way of estimating ALU saturation. Would you happen to know of a more direct way to measure this?

vbharadwaj-bk Jul 14, 2025
Maintainer

Nice stuff - we don't do much more sophisticated things, and roofline is our go-to performance analysis tool. We analytically count the number of arithmetic ops per tensor product (accounting for the FMADD vs mul + add distinction) and the theoretical minimum number of data words required to move in / out of RAM, unique to each TP configuration. We measure the runtime of a fixed large # of of tensor products to measure how close to the theoretical memory / compute peaks we are.

You can use Nsys / Nsight compute to get a more detailed analysis, which will even diagnose why you aren't hitting a high fraction of the peak (e.g. you're using too many registers, etc). It will give you a percent from 0-100% of compute utilization, but this may include junk ops / ops that are not floating point calculations to do the TP, which is really what we care about.

electrojustin Jul 22, 2025
Author

I finally managed to implement a fully connected version of my TP and fix some issues with my venv to get a proper apples to apples forward benchmark. It looks like OEQ is really good at shorter vector lengths, whereas my implementation tends to do better at e.g. mace-large. I guess that kind of makes sense given that I didn't implement any kind of kernel fusion - I just run the Full TP with irrep filtering and then do some dense matmuls and a permute at the end. It does OK with high multiplicities where the kernel launch overhead gets amortized over the cost of the actual matmul.

One interesting thing that I found is that the Full TP, even without irrep filtering, generally runs way faster than the Fully Connected TP layer. The bottleneck seems to be in the linear mixing itself, which is a little surprising to me because I think it suggests that equivariant NNs shouldn't run much slower than a traditional fully connected NN, all other aspects of the architecture being equal. Have you folks happened to notice anything similar? e3nn doesn't show the same phenomenon, and I couldn't figure out how to configure OEQ (or CUEQ) in Full TP mode so I don't have any other data points on hand.

vbharadwaj-bk Jul 22, 2025
Maintainer

Nice! Note that these configurations MACE-large, etc., excluding DiffDock and Tetris-Poly, aren't fully connected tensor products; they're UVU (channelwise) tensor products. Check out our code for how to set up the instruction list for those. I don't think cuE is that optimized for fully connected, but they are highly optimized for UVU, as indicated by the fact that their perf. is lower than e3nn in some cases in your benchmarks. But it's cool that you're doing well if you interpret the connection mode as FullyConnected instead.

Yes, you're right - the matmul for fully connected tensor products at the end is expensive if done naively, mainly because it's a string of small matmuls instead of one monolithic matmul. That seems to be why most performant architectures (e.g. Nequip, MACE) don't use FullyConnected Tensor products, they use channelwise tensor products instead.

electrojustin Jul 23, 2025
Author

I see, so channelwise is similar to the natural tensor product, but with the outputs elementwise multiplied by the weights? It seems a little strange to me to eliminate the self interaction, but that does indeed seem to make a huge improvement in OEQ and CUE performance. CUE has comparable performance with my implementation when layout=IR_MUL now. OEQ has pretty much double my implementation's performance in most of the tests - only tetris-poly-1 and mace-large, the smallest and the largest, are really comparable curiously. In many cases, OEQ actually beats the fake implementation I've been using as a reference, which is really striking. I'm guessing that has something to do with OEQ being able to eliminate all branching, whereas torch kernels need to be more generalized?

vbharadwaj-bk · 2025-07-23T05:43:27Z

vbharadwaj-bk
Jul 23, 2025
Maintainer

I see, so channelwise is similar to the natural tensor product, but with the outputs elementwise multiplied by the weights?

Yes, that's a good description of it. The self-interaction / mixing in Nequip /or MACE comes later from a Linear layer applied after the TP + scatter. The weights for this latter Linear layer are shared, which improves the efficiency of the overall operation (i.e. the weights in the UVU TP are unique to each batch element, since they come from atomic distances; the weights of the linear layer following the tensor product are common to all batch elements, improving efficiency and allowing linear mixing). Looking at the interaction block code for Nequip and MACE was helpful to us to understand this design pattern.

MACE-large we clearly have room for improvement through kernel tuning, etc. - but yeah, we (and cuE) optimize fairly extensively for UVU kernels by inlining. The backward pass (the dominant cost even at inference time) is a bit more interesting, since the instruction count can be much larger than the forward case.

0 replies

DRAM Bandwidth Compression #148

Uh oh!

electrojustin Jul 8, 2025

Replies: 4 comments · 17 replies

Uh oh!

vbharadwaj-bk Jul 8, 2025 Maintainer

Uh oh!

vbharadwaj-bk Jul 8, 2025 Maintainer

Uh oh!

Uh oh!

electrojustin Jul 9, 2025 Author

Uh oh!

vbharadwaj-bk Jul 9, 2025 Maintainer

Uh oh!

electrojustin Jul 13, 2025 Author

Uh oh!

vbharadwaj-bk Jul 14, 2025 Maintainer

Uh oh!

electrojustin Jul 22, 2025 Author

Uh oh!

Uh oh!

vbharadwaj-bk Jul 22, 2025 Maintainer

Uh oh!

electrojustin Jul 23, 2025 Author

Uh oh!

vbharadwaj-bk Jul 23, 2025 Maintainer

electrojustin
Jul 8, 2025

Replies: 4 comments 17 replies

vbharadwaj-bk
Jul 8, 2025
Maintainer

vbharadwaj-bk Jul 8, 2025
Maintainer

electrojustin
Jul 9, 2025
Author

vbharadwaj-bk
Jul 9, 2025
Maintainer

electrojustin Jul 13, 2025
Author

vbharadwaj-bk Jul 14, 2025
Maintainer

electrojustin Jul 22, 2025
Author

vbharadwaj-bk Jul 22, 2025
Maintainer

electrojustin Jul 23, 2025
Author

vbharadwaj-bk
Jul 23, 2025
Maintainer