Sharded Matrices and How to Multiply Them | How To Scale Your Model #5

jacobaustin123 · 2025-02-03T02:22:12Z

jacobaustin123
Feb 3, 2025
Maintainer

Sharded matrix multiplications galore!

mitchellgoffpc · 2025-02-05T17:34:37Z

mitchellgoffpc
Feb 5, 2025 — with giscus

In the solution for pop quiz 2, the bidirectional ICI bandwidth for a TPU v5e is given as 9e10 bytes/s, which doesn't quite match the value of 1e11 bytes/s given in the table in part 2. Looking at https://cloud.google.com/tpu/docs/v5e, it appears that the value in the table is the correct one.

1 reply

jacobaustin123 Feb 5, 2025 — with giscus
Maintainer Author

Good catch. I think 9e10 is closer to the real value. I've updated the answer with that figure.

karunreddy30 · 2025-02-09T03:43:42Z

karunreddy30
Feb 9, 2025 — with giscus

In the Section - "A quick aside: how would we describe this in code?"

"For instance, in the above example, the local shape of A is [4, 1024] and for B is [2048, 4096]"

I think local shape of A is [2, 1024]?

1 reply

jacobaustin123 Feb 9, 2025
Maintainer Author

You're right, fixed.

burichh · 2025-02-09T20:13:30Z

burichh
Feb 9, 2025 — with giscus

On the first picture, you state that the shape of matrix A before sharding is [ $I$, $J$ ]. My immediate interpretation was that it means that it has $I$ rows and $J$ columns, but seemingly you invert the meaning of $I$ and $J$ on the sharded plot: $J_Y$ spans on the row dimension, and $I_X$ spans on the column dimension. For me this gets confusing with the Computation With Sharded Arrays chapter, when you explain for example in Case 1 that

$$A[I_X, J] \cdot B[J, K_Y] \rightarrow C[I_X, K_Y]$$

For me, this would mean that A is sharded across its rows, and B is sharded across its columns, thus we have everything to calculate a single element of the result C, because the contracting dimensions are not sharded. But because you reversed the meaning if $I$, and it means the columns, for me it gets a bit unintuitive how the matrix multiplication is trivial to execute.

Could you enlight this with an image?

I think I get the point, but visually it would help a lot what you mean exactly with these $I_X$, $J$ and $K_Y$.

6 replies

levskaya Feb 9, 2025
Collaborator

Do you recall which way did I have it before? I rows, J cols sure, but did I also have X vertical, Y horizontal to make things line up?

I was probably following matrix multiplication (row, col) ordering conventions when I first made the figures since people tend to have that ordering burned into their minds during school.

burichh Feb 10, 2025 — with giscus

So I created an image of what I believe you wanted to depict in the "Computation With Sharded Arrays Case 1: neither multiplicand has a sharded contracting dimension" with the notation above:

I changed the meaning of $I$ and $J$, so now $I$ is the row and $J$ is the column index for the $Data$ matrix. On the bottom I draw the matmul(Data, Weights) in a sharded way, where Data is sharded along its rows across the mesh dimension $Y$ and Weights is sharded along its columns across the mesh dimension $X$. In this notation:

$$Data[I_Y, J] \cdot Weights[J, K_X] = Out[I_Y, K_X]$$

.
I still feel that somehow this could be improved, because $J$ in $Data[I_Y, J]$ stands for the column index, while $J$ in $Weights[J, K_X]$ stands for the row index, which still might be confusing, looking only at the picture. Not even mentioning that the meaning of $K$ is not defined on the picture, although one can deduce that it must be the number of columns in $Weights$. Not easy to make this clear, I admit:D

manavgarg Feb 20, 2025 — with giscus

+1. I am a bit confused with the (I,J) notation. Does I refer to rows of a matrix? If that is the case, then why does (Ix,J) split the columns across the axis instead of rows.

Jiminator Mar 3, 2025

Wait yeah, I got lost on this one, because I feel like the diagrams are inconsistent with the text. If we use these images,
it seems that A[Ix,J]*B[J,Ky] is shared over the dimension we are summing over (contracting dimension), which goes against "neither A nor B has a sharded contracting dimension". If we go by standard row,col notation rather than x,y notation, the text makes sense, but obv doesn't align with the images. From the code, it seems jax follows row, col notation so I think the diagrams should resemble the ones made by @burichh. Thoughts?

findmyway Mar 13, 2025 — with giscus

+1

It would be much more clear to keep it consistent in either row major or column major.

kerrickstaley · 2025-02-11T03:54:22Z

kerrickstaley
Feb 11, 2025 — with giscus

Some issues with question 2:

In part 2's solution, I think you mean for X to be in the denominator. The result is the same because X = Y in this case.

In part 3's solution, you mention TPU v5e, but the question asks about v4p.

In part 4, I'm not sure what AllGather with a {U_Z} dimension means. I believe this is not addressed in the text of the chapter. Also, the solution again mentions v5e.

1 reply

jacobaustin123 Feb 11, 2025
Maintainer Author

Apologies for the less than ideal answer. I've updated this. I've removed part 2 since it's identical to part 1, and changed (4) to an AllReduce.

Shua1 · 2025-02-11T07:08:50Z

Shua1
Feb 11, 2025 — with giscus

Creates a jax.Mesh that maps our 4 TPUs into a 2x2 grid with names ‘X’ and ‘Y’ assigned to the two axes.

The code below it and the in the code really says: 8 TPUs into 4x2 grid.

1 reply

jacobaustin123 Feb 11, 2025
Maintainer Author

Fixed, thanks!

Zasder3 · 2025-02-15T22:10:57Z

Zasder3
Feb 15, 2025 — with giscus

I believe question 4 may have miscalculated the comms overhead for $AllGather_XB[D_X,F]$. In the solution it's stated as $2BD$ when it should be $2DF$ (e.g. gathered over the A matrix instead of the B matrix)?

1 reply

jacobaustin123 Feb 16, 2025 — with giscus
Maintainer Author

I think this is now fixed. Yes it should be 2DF here.

chipturner · 2025-02-16T05:49:59Z

chipturner
Feb 16, 2025 — with giscus

The flow in this chapter is a little jarring when it drops into the four cases without defining the term "contracting dimensions" or doing other setup to smooth the transition. Maybe an external reference or a bit more connective flow would help?

1 reply

jacobaustin123 Feb 16, 2025 — with giscus
Maintainer Author

I agree, I've done a very mild rewrite of the transition. Can you tell me if you think this is better or if there are other changes you think I should make?

jesse7chen · 2025-02-22T08:02:06Z

jesse7chen
Feb 22, 2025 — with giscus

In the solution to question 4, I believe it should be D < C / W_ici instead of F < C / W_ici when calculating when we are comms bound in strategy 1. The wording is also a bit confusing because it says "In the second case (baseline)", but it appears to be talking about strategy 1 if I'm not mistaken? Also a small grammatical error at the end of the solution - "we'll shard our parameters" instead of "we'll sharded our parameters".

1 reply

jacobaustin123 Feb 24, 2025
Maintainer Author

Yes, fixed. Thank you for catching these.

batterseapower · 2025-02-23T11:07:20Z

batterseapower
Feb 23, 2025 — with giscus

The text says "For example, A[IX,J]⋅B[J,K]→C[IX,K] can be multiplied without any communication because the contracting dimension (J, the one we’re actually summing over) is unsharded. However, if we wanted the output unsharded (i.e. A[IX,J]⋅B[J,K]→C[IX,K]), we would need to copy A or C to every device.". Presumably the last "C[IX,K]" should actually be "C[I,K]"

1 reply

jacobaustin123 Feb 24, 2025
Maintainer Author

Thanks for catching this. Yes, absolutely, fixed.

AakashKumarNain · 2025-02-28T14:41:51Z

AakashKumarNain
Feb 28, 2025 — with giscus

As someone who is fairly familiar with sharding and JAX, I think the flow of this chapter can be refined and the details (along with the notations) can be improved a lot. I am happy to contribute if you guys are open to contributions? I mean it when I say this is confusing and can be simplified

5 replies

chipturner Feb 28, 2025

I think perhaps einsum notation would help? It's also serves as a breadcrumb to get more context if the reader needs to learn or refresh some knowledge

jacobaustin123 Mar 3, 2025
Maintainer Author

Certainly happy to accept contributions, but maybe run the central ideas by me before spending too much time? I think in particular the subscript sharding notation here is important to the rest of the document and shouldn't be changed

AakashKumarNain Mar 3, 2025

Sure thing. The collectives part needs more refinement and there should be code corresponding to each of the four cases. It is easy to tell people why a certain assertion does not work, but a code makes it much clear when that assertion does not work. For example, in the (only) code sample, we sharded one of the arrays over the non-contracting dimension, and then proceeds to say JAX/XLA will automatically add communication across these arrays as necessary to perform the final multiplication.

But the end user, especially the ones who are new to JAX, will keep thinking about the case where this does not work automatically. Explicitly providing examples for both scenarios is a must, and this should be done for all the four cases. Given code is folded, I do not think, it will make the chapter extra long, but it will definitely set the agenda of when to think about collectives and when not in a much better way.

Re notation: I won't suggest to change the notation, but I would add a line stating that the sharding names does not have any meaning like rows and columns, it is just a lookup for partitioning the data along a certain data dim by that value

jacobaustin123 Mar 3, 2025
Maintainer Author

Cool. What would code look like in each case? I guess the thing is, XLA does add this communication, so I'm not sure what we'd show. We could obviously write jax.lax.psum for AllReduce, but I don't know how illustrative that is. Or do you just mean sharding the arrays and then running the matmul?

sshkhr May 1, 2025 — with giscus

@AakashKumarNain check out the shard_map tutorial on the JAX official documentation, it addresses some of the stuff that you wanted to see added in this chaper:

Manual parallelism with shard_map

Lhongpei · 2025-03-01T15:02:44Z

Lhongpei
Mar 1, 2025 — with giscus

Can you explain more about AllReduce? Because I think I misunderstand what this actually do in Question2, Part 3.

In my opinion, after we do $\text{AllReduce}_Z (B_X, D_Y) {U_Z}$, the shape of data is still $(B_X, D_Y)$ but the data will be reduced (like summed), so my answer is

$$ \text{T} = \frac{2 \times B \times D \times 2}{|X| \times |Y| \times W} $$

because there is no communication between X and Y?

1 reply

jacobaustin123 Mar 3, 2025
Maintainer Author

Absolutely, you're right. I've fixed the answer. This was a silly oversight on my part.

Lhongpei · 2025-03-02T01:03:30Z

Lhongpei
Mar 2, 2025 — with giscus

In Question 3, why the answer says "Since we have an axis of size 4 on a TPU v4p, we have a wraparound link, so we can do the AllGather by sending half the bytes in each direction". In the GIF above, I think each device sending the whole bytes in each direction? Is there any difference?

1 reply

jacobaustin123 Mar 3, 2025
Maintainer Author

Good question, I think this was a bit of a red herring. We can do an AllGather by sending half the array in each direction for 4 hops, or the full array in both directions for 2 hops. I've updated this to do the latter.

amoudgl · 2025-03-03T08:29:31Z

amoudgl
Mar 3, 2025 — with giscus

Thanks for the great work! I have a question in bi-directional all-gather case: since each hop sends $V / |X|$ bytes in both directions in parallel, shouldn't the time per hop be $T_{hop} = \frac{V}{|X|.W_{ICI}}$? The article mentions $T_{hop} = \frac{2.V}{|X|.W_{ICI}}$ and it's unclear to me why the time would double in bidirectional case for each hop.

7 replies

harisjavaid85 Apr 12, 2025

@jacobaustin123 Thanks for the write-up. To me, the W_bidriectional term is essentially the aggregate unidirectional bandwidth for a TPU in a particular axis and not the bidirectional bandwidth of a single ICI link. In this particular scenario, we happen to have only two links per TPU in a particular axis, so the math works out. Is my understanding correct? (the bidirectional bandwidth of a link has a somewhat different meaning in the traditional networking field where it means that a link can send and receive at the same time at that bandwidth aka full-duplex).

jacobaustin123 Apr 12, 2025
Maintainer Author

Yes I think that's roughly correct. Per-link, the TPU can send and receive at the unidirectional bandwidth. The links are full-duplex in the sense that you can send and receive at that bandwidth simultaneously. Because you have two links per axis, you can send and receive at the bidirectional rate in aggregate, but not per-link.

harisjavaid85 Apr 13, 2025

Thanks. I have one more question regarding AllGather on multiple axes (V / W_ICI * n_axes). Assume we have sharded an array on both X and Y axes on a mesh of 4x4 TPUs. We should do an AllGather on X axis first, followed by an AllGather on the Y axis instead of AllGather being executed on both the axes simultaneously (as assumed in your equation). For AllGather_X, V/4 data will be involved, while for AllGather_Y V data will be involved. So, I think the T_comms should be (V/4) / W_ICI + V/W_ICI? Can you provide more details on how an AllGather is scheduled when multiple axes are involved. Thanks.

jacobaustin123 Apr 13, 2025
Maintainer Author

No, you do want to do it at the same time. By doing one and then the other you waste bandwidth by duplicating some of the values across the sharded dimension. We dont have an animation for the algorithm but you can imagine sending chunks gradually along both axes simultaneously to avoid ever duplicating comms.

harisjavaid85 Apr 14, 2025

I see, got it, thanks!

gitnicos · 2025-03-05T19:54:42Z

gitnicos
Mar 5, 2025 — with giscus

this is a fantastic book! Kudos to the authors and big THANK YOU!

I think this section is super critical in appreciation of TPU differentiation vs GPU but needs quite substantial rework:

Often visuals are supposed to help to understand complex underlying concepts. Here though in the very first diagram that lays out data rows as J and features as I trying to superimpose on Y,X as it is accepted in graphics industry is confusing. In data science I are always Rows, J is always Columns. Unless I am missing something. Continuous mental acrobatics to transpose while reading challenging material is not helping.
This gets exacerbated by "exhaustive" diagram of possible combinations with not intuitive indexing, and what actually gets sharded and The replicated in each case. I would stick to the aforementioned four typical cases with proper diagrams.
Color scheme doesn't work as cue hinting what happens. Trying to discern color shades and exactly where and how splitting in shards happens is another dimension of unnecessary mental acrobatics. Instead I would enumerated blocks - before and after sharding - as boxes and numbers in circle as identifiers of the boxes. Keep simple. Uncluttered.
Interestingly, the text itself with standard mathematical eisum notation is much clearer and intuitive. Maybe indeed we'd rather stick to it.
Now, this one is more important - maybe What needs to precede How? I would move Transformer and Training before this section to posit the problem area first, and then elaborate how optimization is done with TPU in mind. Otherwise, trying to dissect intricacies of TPU/systolic arrays without firm footing in exactly what they solve (which is restatement of Transformer details) becomes dry academic curriculum causing unnecessary frustrations. I think the whole book's ethos is a practical guide in complex area.

I hope my feedback is not misconstrued. I feel this book overall is phenomenal in its objectives and style, and definitely stands out in the crowd of similar efforts. Thank you again!

2 replies

jacobaustin123 Mar 5, 2025 — with giscus
Maintainer Author

Thank you for the very helpful and detailed feedback. Firstly, let me say that I agree that, of all the sections, this may be the weakest in its flow – introducing only the concepts that are necessary and using visualizations to support understanding rather than add confusion. Let me address a few of your comments, but I basically plan to do a rewrite of this section as soon as I have the time.

1/2/3. I agree with all of this, and will try to fix.

You might disagree but I feel like the training section in particular might be difficult without this section at least defining the notation that we'll use for sharding/communication. I think the user needs to be able to reason about e.g. the cost of an AllGather to do the Training section.

I'll try to do a rewrite this weekend and I'll send a draft your way if you'd be willing to glance at it.

gitnicos Mar 6, 2025 — with giscus

That will be awesome, Jacob! I truly appreciate how much passion and efforts you guys are putting in conveying non trivial yet much needed knowledge of the topic. I am here to help however I can.

Here is how I would attack this particular dilemma (5):

Section 2. "All About TPUs" itself is a good teaser and sets right tone of what the architecture is about, and how it addresses the problem of parallelization in general terms. It very well builds on "Intro to Rooflines" narrative.
after that a phenomenal section 4 "Transformers", "accounting" including would come in play as laying out the whole landscape and challenges at hand. This is where I would introduce the principles of parallelization to alleviate the challenge. Including setting firm conceptual understanding of a) splitting across batches, b) splitting across model dimensions, c) splitting across ff, d) mix of them and gently introducing mental picture of axises of sharding and main primitives AllGather, AllReduce, ReduceScatter etc. All schematically, nothing very involved. Just simple illustrative corroboration of the main idea ahead. GEMM needs to be covered here too as precondition to appreciate the rest down the road.
after having firm idea of possibilities and having armed the reader with elementary idea of challenges of parallelization, contracting sides etc. and notation of axises and sharding bare minimum we can go full steam ahead into "Section 5. Training"
this can be succeeded by "Sharded Matmuls revisited" to provide further inner workings and deeper analysis
having being solidly armed with that knowledge I think "Section 6. Training LLaMA" will be particularly rewarding experience.

I agree, it is challenging aspect of fusing classical academic bottom-up approach with practitioner's top down guide as this gem book is vying to become.

kostyaby · 2025-03-09T23:38:08Z

kostyaby
Mar 9, 2025 — with giscus

In question 4, I believe some math + reasoning for All-Gather being the preferred strategy is incorrect.

At the beginning, T_total_(strategy_1) is correctly defined to be max(2 * B * D * F / C, 2 * D * F / W_ici), but later this term is incorrectly evaluated to say that "we're generally compute-bound as the condition for that is D > C / W_ici". In reality, the correct condition is B > C / W_ici, as the following math shows:

T_total_(strategy_1) = max(2 * B * D * F / C, 2 * D * F / W_ici) = 2 * D * F * max(B / C, 1 / W_ici)
Compute-bound when B / C > 1 / W_ici => B > C / W_ici (~ 2550)
ICI-bound when B / C < 1 / W_ici => B < C / W_ici ~ 2550

So for reasonably common batch sizes, we're ICI-bound for strategy 1, as we are for strategy 2. In that case, need to compare ICI times for both strategies to decide which one is best. Strategy 2 is best when:

4 * B * F / W_ici < 2 * D * F / W_ici => 2 * B < D

So basically, for reasonable batch sizes (~1-2K) and D (~4K) strategy 2 is better than strategy 1. I also built a bunch of plots in this Colab, which showed that for certain large values of D & F it's never even beneficial to do strategy 1 (for example, when D=8K, F=16K) while for other values (D=4K, F=16K) it's better to do strategy 2 for B<2K and then it's slightly better to do strategy 1 for larger values of B

Unless I screwed up doing my math above, I believe the recommendation that the "All-Gather" strategy is better for Case 2 should be reconsidered. At smaller batches, the "All-Reduce" strategy seems to be much better. It also makes sense when reasoning about it at the high level: when you have a giant weight matrix (i.e. B[D, F]) and relatively smaller activation matrices (i.e.A[B, D] + C[B, F]), it makes sense that we would prefer to do less matmul FLOPs + do comms for a smaller activations matrix (i.e. C) rather than doing more matmul FLOPs + more comms for the giant weight matrix (i.e. B)

--

A small nit re. the same question: it never mentions we want to do everything in bfloat16, would be great to add that info.

--

Thank you for reading and also thank you for providing such a great learning resource for the community!

2 replies

jacobaustin123 Mar 12, 2025
Maintainer Author

I went back and double checked this and I think you're basically right – there was a typo in one part of the answer. Either way, I think you're right that the AllReduce strategy is often better for reasonable models. I fixed the math, so lmk if it seems more correct to you.

Either way, I think the takeaway is that this AllReduce strategy is better in most cases, except for the caveat at the end which is that it's rare to be in a situation where $X$ isn't sharded across the axis that $Y$ is being contracted over. So it's good only in somewhat contrived cases. For instance in FSDP we have this situation but we choose to gather because $X$ is sharded over the batch dim

kostyaby Mar 15, 2025

Yep, looks good to me. Thanks for the fix!

Nielius · 2025-03-10T15:13:47Z

Nielius
Mar 10, 2025 — with giscus

Thanks for this book -- I've learnt a lot!

I have a question about the calculation for the time it takes to do an AllGather, where the conclusion was that the time does not depend on $|X|$. I think that if you do it carefully, you conclude that it actually does depend on $|X|$, by a factor of $\frac{|X| - 1}{|X|}$:

if $|X|$ is odd, you only need to do $\frac{|X| - 1}{2}$ hops (not $|X|/2$ hops). This leads to a total time of $\frac{|X| - 1}{|X|} \cdot \frac{V}{W}$
if $|X|$ is even, you need to do $\frac{|X|}{2}$ hops, but the last hop is only half the size, as the animation (which I love!) shows. Again, you end up with a total time of $\frac{|X| - 1}{|X|} \cdot \frac{V}{W}$

Of course, if $|X|$ is really big, $\frac{|X| - 1}{|X|}$ is very close to 1, so this might just be a nitpick. But I thought $|X|$ could very well be 2, 4, or 8? Or did I misunderstand that?

I also like this other way of reasoning about how much time it should take: each of the $|X|$ shards needs to receive $\frac{|X| - 1}{|X|} \cdot V$ bytes; because we can fully utilize the ICIs, that's going to take $\frac{|X| - 1}{|X|} \cdot V/W$ seconds.

2 replies

jacobaustin123 Mar 10, 2025
Maintainer Author

Yes this is all correct. In general it depends weakly on X, especially when X is small. I'll add a comment trying to make this clear but in general, I think the moral truth is that, although it has some dependence, it quickly becomes <20%, which means it's safe to mostly ignore

tb5874 Apr 8, 2025

@Nielius @jacobaustin123
i have a same question, so i just try more detail of even case.
in here, unidirectional and bidirectional bandwidth assumption is important.
unidirectional is bidirectional (/ 2) or (* 2)
and |X| more than 8, becomes < 20%
this mean ... general GPU system is not profit. TPU is better choice.
( single server or desktop have 2 ~ 4 gpu or DGX H100 x 8 )

( * if i have miss or wrong assumption, please let me know what is it )

gitnicos · 2025-03-12T18:42:04Z

gitnicos
Mar 12, 2025 — with giscus

Thanks for the visuals clarity fix, Jacob! much appreciated

2 replies

jacobaustin123 Mar 12, 2025
Maintainer Author

Credit to @levskaya for this!

gitnicos Mar 12, 2025 — with giscus

true that! Thank you @levskaya and the whole team!!

gitnicos · 2025-03-12T19:15:24Z

gitnicos
Mar 12, 2025 — with giscus

In A Deeper Dive into TPU Communication Primitives I would add intuition behind "mechanics" of matrix/shards juggling. The Whys
...

AllGather: removes a subscript from a sharding, gathering the shards.
Remember, before non-linear function can be applied the whole matrix should be materialized (reevaluated). Hence we need to gather all shards (merge/concatenate/un-shard being synonymous)
ReduceScatter: removes an “un-reduced” suffix from an array by summing shards over that axis, leaving the array sharded over a second axis.
In TPU like topologies it is beneficial to do partial sums on respective shards (e.g. gradients) and simultaneously re-shard those partial sums for future use. The future use could be summing all gradients, which is accomplished by AllReduce and AllGather in order to update weights before next training step. ReduceScatter significantly cuts down on communication necessitated by intermediate partial sums messaging.
AllReduce: removes an “un-reduced” suffix, leaving the array unsharded along that axis.
Reduces all intermediate sums before final updates to weights can happen. It uses "intermediate" results of ReduceScatter above
AllToAll: I am a bit undecided articulating intuition behind this :-) i.e. exactly how and why it is used in training machinery is a bit nebulous for me

1 reply

findmyway Mar 13, 2025 — with giscus

... exactly how and why it is used in training machinery is a bit nebulous for me

AllToAll is core operation when training MoE. You may search for "expert parallelism" + "all to all" to understand it better.

findmyway · 2025-03-13T10:33:21Z

findmyway
Mar 13, 2025 — with giscus

This article is very useful for me coming from GPU world without any TPU background before!

I think it worth pointing out that the AllToAll operation mentioned here assumes the number of data to send & receive is the same on each device. In real world cases (especially with MoE), this is not always the same. Thus the runtime is somewhat data dependent.

1 reply

jacobaustin123 Mar 13, 2025
Maintainer Author

True, although it's worth noting that the core TPU AllToAll operation enforces that the data sent/received is the same on all devices. On a GPU, you can do dynamic AllToAlls much more easily than on TPUs. TPUs require fairly complicated control flow to do dynamic comms like that

parkji30 · 2025-03-18T15:46:14Z

parkji30
Mar 18, 2025

Hi Jacob, new poster here, thank you for this blog post. This might be a stupid question but it seems to me that online resources (yours included) suggest when we shard our data across various devices, the dimension of the data is always (Batch size, Length). It seems that the embedding of the data is done after it has been distributed across the devices.

I was wondering what if in the case, how could we shard a batch of pre-embedded multidimensional data, i.e. (Batch Size, d1, d2, d3)?

I'm currently working on training an equivarient neural network that ingests crystal data to learn some latent space, however, the runtime complexity is quite heavy so I am trying to distribute the training across a GPU cluster. My data is represented as a 3D incidence matrix where the channels are embedding representation of Nodes and Edges so the dimensions would be (Node Len, Edge Len, Embedding Dim). When I batch this data, I would get (Batch, Node, Edge, Embed).

Is the best practice to just embed the data during the training iteration or is there a workaround to distributed these multidimensional batches. I'm relatively new to JAX and distributed training as a whole so it doesn't seem there are a lot of resources around this.

1 reply

jacobaustin123 Mar 18, 2025
Maintainer Author

I'm going to interpret the question as: if I have data of the form (batch, node, edge, embed_dim), how do I distribute it? I think the basic answer is, it depends on what computation you're doing across nodes/edges.

Likely, you can do basic data parallelism, and that ought to get you pretty far assuming you don't communication between batches. Similarly, you can probably do tensor parallelism by sharding along the embed_dim. Beyond that, you'll have to do more complicate communication, but it may also be fine. Still, those two alone will probably get you far enough.

TLDR: shard along batch and maybe embed_dim/

WellyZhang · 2025-03-21T11:32:45Z

WellyZhang
Mar 21, 2025 — with giscus

I'm thinking about the equivalent of TPU axis in a GPU server. For a fully connected GPU clique of N devices (say 8 or 16 GPUs connected by NVLink / NVSwitch), is the bandwidth basically N * unidirectional link speed? Since a TPU is connected at most to 6 neighbours (max speed 6 * uni-W_ICI), it seems to me for communication operations, TPUs would be much slower? (Ignoring cost)

3 replies

jacobaustin123 Apr 12, 2025
Maintainer Author

The bandwidth between any two devices is still just the link speed (because you're bottlenecked by the actual link connecting to each device). The advantage is that you can send data between any of those 8 or 16 GPUs directly instead of hopping through intermediate devices.

WellyZhang Apr 14, 2025 — with giscus

Hi Jacob, thanks for the clarification. But does that suggest to AllGather a tensor of V in a TPU setup with N chips and 3 axies would take V / (3 * W_ICI), but on a fully-connected GPU clique, that only takes V / N / uni_W_ICI (each chip holds V / N of data and it can talk to each other simultaneously withou intermediate hops)?

jacobaustin123 Apr 14, 2025
Maintainer Author

Caveat: I'm not a GPU expert. Firstly, in an all-to-all AllGather, each device sends V / N to all other hosts, so the total bandwidth usage on the central switch is V * N, so the total time would be V * N / switch bandwidth. I think in general the switch bandwidth is what bottlenecks you. You can imagine instead doing something hierarchical in stages that would be less expensive.

zpcore · 2025-03-21T17:56:46Z

zpcore
Mar 21, 2025 — with giscus

Hi Jacob, thanks for the great knowledge sharing! Regarding the A note on ICI latency, should it be $T_{hop}=T_{min}+\frac{2v}{|X|W_{ICI}}$ instead of $T_{hop}=max[T_{min}, \frac{2v}{|X|W_{ICI}}]$? My understanding is that the overhead should always exist.

0 replies

nishanthdikkala · 2025-06-24T00:01:01Z

nishanthdikkala
Jun 24, 2025 — with giscus

In question 10.1, why is the number of floats communicated by a ReduceScatter the same as that of AllGather? Doesn't ReduceScatter need to communicate less since the partial sums remain scattered and don't need to be gathered?

1 reply

jacobaustin123 Jun 24, 2025 — with giscus
Maintainer Author

It does half as much as an AllReduce but the same as an AllGather. A ReduceScatter starts with an array of the full size (unsharded) but can send only a piece around each time.

elzino · 2025-06-27T15:29:24Z

elzino
Jun 27, 2025 — with giscus

In Pop quiz 2 Part 1, I wonder if we should use unidirectional bandwidth (which is 4.5e10) because Y axis size is smaller than 16. IIUC, the answer should be Tcomms=34e6/4.5e10=756μs. I'm curious if I'm missing something.

1 reply

elzino Jun 27, 2025 — with giscus

Ah, nvm. I should've read the rest of the explanation. Tcomms=3∗8.4e6/4.5e10=560μs makes sense!

sabujlaskar · 2025-06-30T07:22:16Z

sabujlaskar
Jun 30, 2025 — with giscus

Hi, thank you for the great explanation. I have a question regarding 10.2. Why is the data size considered to be $\frac{N^2}{D^2}$? Since the matrix size is $N \times N$ and there are a total of $D$ devices, I would expect each device to send $\frac{N^2}{D}$ data to other devices for the AllToAll communication as well. Why are we assuming the data size to be $\frac{N^2}{D^2}$ in this case? Could you please clarify what I might be missing?

0 replies

SimonBerens · 2025-07-06T02:28:43Z

SimonBerens
Jul 6, 2025 — with giscus

Could you clarify the $n$ dimensional AllReduce? If you have $V$ bytes sharded across $X_1,\dots, X_n$, my understanding is every machine will start with $B = \frac{V}{\prod |X_i|}$ bytes. Then per ring per hop each machine will be sending $2B$ bytes, which will take $T_{hop} = \frac{2B}{W_{ICI}}$ time. The longest path in the mesh is $\frac{1}{2}\sum|X_i|$ so $T_{total} = \sum|X_i| \cdot T_{hop} = \frac{V \cdot \sum|X_i|}{W_{ICI} \cdot \prod |X_i|}$, which is different from the stated $\frac{V}{W_{ICI} \cdot n}$. Where is my mistake?

I also don't have a great intuition of how an $n$ dim AllReduce is actually implemented on TPUs. One way would be running an AllGather per dimension sequentially, essentially reducing the dimensionality of the sharding/mesh by 1 until it's 1. But this feels like it's leaving bandwidth on the table since the $n$ ICI rings per chip are independent? Though o3 says in practice for large tensors a TPU doesn't actually have enough VMEM to do hold the $2n$ tensors per hop (across all rings). And then even if you can do all $n$ rings concurrently you still need multiple iterations to deal with hamming distance $n$ indices.

5 replies

SimonBerens Jul 6, 2025 — with giscus

Sorry I meant AllGather

jacobaustin123 Jul 7, 2025 — with giscus
Maintainer Author

Where is 2B coming from? You're right about the longest path, but most nodes are much closer than that longest path, so as parts of the AllGather complete, you can use the additional bandwidth to send stuff faster. The heuristic argument is that in a 2D mesh you have roughly O(|X| * |Y|) total bandwidth to send V bytes, so it ought to go like V / W_ICI * n. The actual algorithm is pretty complicated, but you can assume it will send the fewest possible bytes in the throughput-bound regime.

SimonBerens Jul 9, 2025

$2B$ because per ring you are sending out in both directions. The heuristic makes sense, because my result would be faster than the aggregate bandwidth.

Do you have a link to an implementation of a 2d/3d AllGather?

Hwhitetooth Jul 21, 2025 — with giscus

I have no idea how the canonical algorithm works but I have a guess for the 2D case. Basically we can construct two rings using the links. For example, you can draw the first ring as follows: start from the top-left node, go all the way right, go down, go all the way left, go down, go all the way right, ..., and on the last row go back to the starting node after you go all the way left (hopefully you get the idea...). Then grab a pen with a different color and draw another ring with the remaining links. Now we have two rings, each connecting all nodes together. We just need to divide the chunks into two groups, one is reduced via one ring and the other via the other ring. This way we basically doubled the overall bandwidth of the system.

Hwhitetooth Jul 21, 2025 — with giscus

By "reduced", I meant "gathered".

yejingxin · 2025-07-07T18:01:17Z

yejingxin
Jul 7, 2025 — with giscus

In Case 3: both multiplicands have sharded contracting dimensions, is the reduce scatter done via bf16 or f32? typically matmul accumulation we need to do with f32, does it mean the communication cost for reduce scatter will be higher?

0 replies

AndrewZhaoLuo · 2025-07-08T05:39:35Z

AndrewZhaoLuo
Jul 8, 2025 — with giscus

For question 7 I believe I might be misunderstanding the notation.

We want to multiple matrices C and B, and take the result and multiply by matrix x correct? In this case, it appears the shapes are incompatible, the result of C * B is [F, F] which is incompatible with the shape x of [B, D].

0 replies

eitanturok · 2025-07-14T22:40:13Z

eitanturok
Jul 14, 2025 — with giscus

In the first pop quiz, you write that

128 * 2048 * 2=512kiB

but this is wrong because 128 * 2048 * 2 = 524,288 = 524kiB.

2 replies

watate Jul 16, 2025 — with giscus

128 * 2048 * 2 / 1024 = 512kiB = 524 kB

eitanturok Jul 24, 2025 — with giscus

oh it's kiB, not mB, got it!

watate · 2025-07-16T13:33:34Z

watate
Jul 16, 2025 — with giscus

This isn't working in the Colab:

!pip install tensorboard-plugin-profile

Should be changed to this:

!pip install tensorboard tensorboard-plugin-profile

0 replies

Sharded Matrices and How to Multiply Them | How To Scale Your Model #5

Uh oh!

jacobaustin123 Feb 3, 2025 Maintainer

Replies: 30 comments · 51 replies

Uh oh!

mitchellgoffpc Feb 5, 2025 — with giscus

Uh oh!

Uh oh!

jacobaustin123 Feb 5, 2025 — with giscus Maintainer Author

Uh oh!

karunreddy30 Feb 9, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 9, 2025 Maintainer Author

Uh oh!

burichh Feb 9, 2025 — with giscus

Uh oh!

levskaya Feb 9, 2025 Collaborator

Uh oh!

Uh oh!

burichh Feb 10, 2025 — with giscus

Uh oh!

manavgarg Feb 20, 2025 — with giscus

Uh oh!

Jiminator Mar 3, 2025

Uh oh!

findmyway Mar 13, 2025 — with giscus

Uh oh!

kerrickstaley Feb 11, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 11, 2025 Maintainer Author

Uh oh!

Shua1 Feb 11, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 11, 2025 Maintainer Author

Uh oh!

Zasder3 Feb 15, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 16, 2025 — with giscus Maintainer Author

Uh oh!

chipturner Feb 16, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 16, 2025 — with giscus Maintainer Author

Uh oh!

jesse7chen Feb 22, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 24, 2025 Maintainer Author

Uh oh!

batterseapower Feb 23, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 24, 2025 Maintainer Author

Uh oh!

AakashKumarNain Feb 28, 2025 — with giscus

Uh oh!

chipturner Feb 28, 2025

Uh oh!

jacobaustin123 Mar 3, 2025 Maintainer Author

Uh oh!

AakashKumarNain Mar 3, 2025

Uh oh!

jacobaustin123 Mar 3, 2025 Maintainer Author

Uh oh!

sshkhr May 1, 2025 — with giscus

Uh oh!

Lhongpei Mar 1, 2025 — with giscus

Uh oh!

jacobaustin123 Mar 3, 2025 Maintainer Author

Uh oh!

Lhongpei Mar 2, 2025 — with giscus

Uh oh!

jacobaustin123 Mar 3, 2025 Maintainer Author

Uh oh!

jacobaustin123
Feb 3, 2025
Maintainer

Replies: 30 comments 51 replies

mitchellgoffpc
Feb 5, 2025 — with giscus

jacobaustin123 Feb 5, 2025 — with giscus
Maintainer Author

karunreddy30
Feb 9, 2025 — with giscus

jacobaustin123 Feb 9, 2025
Maintainer Author

burichh
Feb 9, 2025 — with giscus

levskaya Feb 9, 2025
Collaborator

kerrickstaley
Feb 11, 2025 — with giscus

jacobaustin123 Feb 11, 2025
Maintainer Author

Shua1
Feb 11, 2025 — with giscus

jacobaustin123 Feb 11, 2025
Maintainer Author

Zasder3
Feb 15, 2025 — with giscus

jacobaustin123 Feb 16, 2025 — with giscus
Maintainer Author

chipturner
Feb 16, 2025 — with giscus

jacobaustin123 Feb 16, 2025 — with giscus
Maintainer Author

jesse7chen
Feb 22, 2025 — with giscus

jacobaustin123 Feb 24, 2025
Maintainer Author

batterseapower
Feb 23, 2025 — with giscus

jacobaustin123 Feb 24, 2025
Maintainer Author

AakashKumarNain
Feb 28, 2025 — with giscus

jacobaustin123 Mar 3, 2025
Maintainer Author

jacobaustin123 Mar 3, 2025
Maintainer Author

Lhongpei
Mar 1, 2025 — with giscus

jacobaustin123 Mar 3, 2025
Maintainer Author

Lhongpei
Mar 2, 2025 — with giscus

jacobaustin123 Mar 3, 2025
Maintainer Author