DRAM Bandwidth Compression #148
Replies: 4 comments 17 replies
-
Hey @electrojustin - cool work! I took a brief look at it. My perspective (Austin has some thoughts as well). It could be useful; keep a few things in mind, though:
So just keep in mind that unrolling and decoding in your hot loop has a cost. We just found this cost worth it, given the advantages of inlining / immediates.
There are some different opinions on this , so feel free to ask around (or just hack our kernel generator, or build your own). We used Jinja, a Python templating engine, to quickly rig a JIT compiler. No substitute for benchmarking :) |
Beta Was this translation helpful? Give feedback.
-
Thank you so much for taking a look and responding @vbharadwaj-bk ! Re benchmarking: Sorry I should have led with some details about my benchmarks. Some thoughts about your technical comments.
Of course, that analysis is predicated on only writing to the output vector after the accumulations are done, which is tricky to do without branching. If you use "+=" after every multiplication, that's certainly going to generate a lot of bus traffic with the extra load/store. "atomicAdd()" seems to perform significantly better despite the atomicity constraints, which makes me think Nvidia's DDR controller has some special dedicated support for that operation and it's effectively just a store from the bus's perspective. My little opportunistic clustering algorithm seems to cut the number of atomicAdd() calls roughly in half compared to the naive approach for my rank 2 benchmark, but again, ideally we should be cutting that by a factor of 3.
Did you find that encoding the coefficients and indices as immediates negatively impacted your icache? I kind of assumed that this would effectively shift the bulk of your data loads from dcache to icache, but if Nvidia uses, for example, a fixed width instruction set, then maybe mixing data and code is effectively a compression algorithm in its own right, since you're already paying the cost of loading a full instruction word regardless of whether the operand is an immediate or a register. Unless of course 32-bit immediates needs to be loaded with multiple instructions ala ARM or RISC-V...
|
Beta Was this translation helpful? Give feedback.
-
Pretty cool! Benchmarking looks solid. I'll be in and out of responding since this week is a bit busy. Interesting.. you might want to try the experiment with some longer input irreps and the Channelwise TPs in MACE / Nequip rather than FullTensorProduct. Another question: why not store input_indices, the object you are decoding, in shared memory? Seems to be the same for every batch element, not sure if you have to load it from global memory each time. Or am I wrong? |
Beta Was this translation helpful? Give feedback.
-
Yes, that's a good description of it. The self-interaction / mixing in Nequip /or MACE comes later from a Linear layer applied after the TP + scatter. The weights for this latter Linear layer are shared, which improves the efficiency of the overall operation (i.e. the weights in the UVU TP are unique to each batch element, since they come from atomic distances; the weights of the linear layer following the tensor product are common to all batch elements, improving efficiency and allowing linear mixing). Looking at the interaction block code for Nequip and MACE was helpful to us to understand this design pattern. MACE-large we clearly have room for improvement through kernel tuning, etc. - but yeah, we (and cuE) optimize fairly extensively for UVU kernels by inlining. The backward pass (the dominant cost even at inference time) is a bit more interesting, since the instruction count can be much larger than the forward case. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
I've been experimenting a little with implementing DRAM bandwidth compression algorithms for tensor products to try to alleviate some of the bus pressure, since as I understand most implementations are memory bound. I have a little proof of concept here which seems to run 2-3X faster than e3nn's implementation. The Clebsch-Gordon coefficients and indices are pretty low in shannon entropy, so I was able to find some low hanging fruit.
Do you folks think these kinds of strategies might be worth exploring further? If I understand these templates correctly, OpenEquivariance takes a totally different approach and just inlines everything and encodes the CBs and metadata as immediates? But I'm wondering if DRAM bandwidth compression might complement some of the other optimization strategies here. For example, my prototype delta compresses the input indices in groups of exactly three to avoid introducing branches in the kernel. This often results in either wasteful padding bytes when adjacent indices can't be compressed together, or encoding indices in uncompressed form needlessly. But the exact pattern of which indices can be compressed together and which cannot is solely determined from the irreps, so there should be a way to JIT compile an ideal compression scheme.
Beta Was this translation helpful? Give feedback.
All reactions