Speculative sampling on Mac Pro with M2 Ultra? #3226

gileneusz · 2023-09-17T09:20:06Z

gileneusz
Sep 17, 2023

Hello everyone,

I'm currently leaning towards purchasing the Mac Studio M2 with 192GB RAM, but I've also been considering the Mac Pro M2 with the same memory configuration.

From my research, it seems there's minimal difference in computational power between these two devices. However, what has caught my attention is the potential of using the high-speed PCIe SSD to expedite model loading times. Specifically, PCIe SSDs can achieve speeds up to 25,000 MB/s, which is vastly superior to the 5,000 MB/s of the Mac's internal SSD. To put this into perspective, a 100GB model could be loaded in roughly 4 seconds on the PCIe SSD, compared to the 20 seconds it would take on the Mac SSD. That's a 5x speed difference.

That said, I also understand that once a model is loaded into memory, these loading times become irrelevant, as there's no need to reload the model.

I'm at a crossroads trying to decide between the two. Has anyone here had experience using quantzed models on the Mac Pro, and if so, could you share your insights?

ianscrivener · 2023-09-18T11:13:49Z

ianscrivener
Sep 18, 2023

Common sense would suggest that it depends on your use case. Were you to be loading different models (say a fined tuned llama2 for x, switching to a fined tuned llama2 for y) - then clearly the faster model load times would be beneficial. If you plan to use the same model all the time and only startup occassionally- then you wouldn't see much benefit after initial startup.

Personally I'd try push to the faster IO. It could happen that new LLM innovations deliver new functionality that would benefit from better IO - perhaps hot loading of Loras, chat session switching, multi-part models that can be hot reloaded (Stable Diffusion 1.5 & 2 had a single model, SDXL has 2 now).

0 replies

ProjectAtlantis-dev · 2023-09-23T07:37:43Z

ProjectAtlantis-dev
Sep 23, 2023

I tried loading llama 7b on 64GB just for giggles along with 70b and here are my thoughts so far:

(1) I ended up putting llama 7b in CPU because the GPU started thrashing too much as it flipped between models - presumably won't be an issue with 96GB but since my CPUs are idle anyway it may not be an entirely crazy idea

(2) speculative sampling seems useless unless both models have temp = 0, so really only viable for doing instruct chain of thought stuff maybe

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speculative sampling on Mac Pro with M2 Ultra? #3226

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Speculative sampling on Mac Pro with M2 Ultra? #3226

Uh oh!

Uh oh!

gileneusz Sep 17, 2023

Replies: 2 comments

Uh oh!

ianscrivener Sep 18, 2023

Uh oh!

Uh oh!

ProjectAtlantis-dev Sep 23, 2023

gileneusz
Sep 17, 2023

ianscrivener
Sep 18, 2023

ProjectAtlantis-dev
Sep 23, 2023