Tensor Parallel All Reduce #9614

mjkpolo · 2024-09-23T21:19:28Z

mjkpolo
Sep 23, 2024

Hi, I am struggling to find where partial results from matrix multiply are reduced together. My understanding is when I use -sm row, that a row-wise tensor parallel approach is employed, where the result of every single matrix multiply needs to be all reduced. However, I only really see a warp_reduce_sum which I assume is used for the tiling that happens across threads in a warp on a single gpu, but not an operation between GPUs reducing the whole matrix.

mjkpolo · 2024-09-26T15:51:23Z

mjkpolo
Sep 26, 2024
Author

It seems that llama actually does a column parallel approach which is misleading since the option passed is sm -row but after ggml_mat_mul is done the result is just copied to a destination as shown for column parallel in the link above.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tensor Parallel All Reduce #9614

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Tensor Parallel All Reduce #9614

Uh oh!

mjkpolo Sep 23, 2024

Replies: 1 comment

Uh oh!

mjkpolo Sep 26, 2024 Author

mjkpolo
Sep 23, 2024

mjkpolo
Sep 26, 2024
Author