Vertical Sharding of Models #4403

jaxs-ribs · 2023-12-10T12:34:39Z

jaxs-ribs
Dec 10, 2023

If we want to distribute shards of a larger model (say llama 70B or larger) across several machines, we can cut off the architecture and weights at the end of a specified transformer block, outputting intermediate activations, which would get fed back into the next shard.

How easy/hard would it be to generate .gguf files that don't lose performance? Is there work being done on that?

If not, I would love to help getting this to work, got it to work on tinygrad already.

cmp-nct · 2023-12-10T16:45:21Z

cmp-nct
Dec 10, 2023

I guess gguf will need to support it, given the size limitation of Huggingface uploads and the increasing model sizes.
It's probably not a huge development, just needs to be done with care to not break mmap support which relies on the file data being 1:1 mappable to internal structures.

A temporary solution is to just use a splitting/recombination utility, so you work with one gguf file locally only

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vertical Sharding of Models #4403

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Vertical Sharding of Models #4403

Uh oh!

jaxs-ribs Dec 10, 2023

Replies: 1 comment

Uh oh!

Uh oh!

cmp-nct Dec 10, 2023

jaxs-ribs
Dec 10, 2023

cmp-nct
Dec 10, 2023