Replies: 2 comments
-
Common sense would suggest that it depends on your use case. Were you to be loading different models (say a fined tuned llama2 for x, switching to a fined tuned llama2 for y) - then clearly the faster model load times would be beneficial. If you plan to use the same model all the time and only startup occassionally- then you wouldn't see much benefit after initial startup. Personally I'd try push to the faster IO. It could happen that new LLM innovations deliver new functionality that would benefit from better IO - perhaps hot loading of Loras, chat session switching, multi-part models that can be hot reloaded (Stable Diffusion 1.5 & 2 had a single model, SDXL has 2 now). |
Beta Was this translation helpful? Give feedback.
-
I tried loading llama 7b on 64GB just for giggles along with 70b and here are my thoughts so far: (1) I ended up putting llama 7b in CPU because the GPU started thrashing too much as it flipped between models - presumably won't be an issue with 96GB but since my CPUs are idle anyway it may not be an entirely crazy idea (2) speculative sampling seems useless unless both models have temp = 0, so really only viable for doing instruct chain of thought stuff maybe |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
I'm currently leaning towards purchasing the Mac Studio M2 with 192GB RAM, but I've also been considering the Mac Pro M2 with the same memory configuration.
From my research, it seems there's minimal difference in computational power between these two devices. However, what has caught my attention is the potential of using the high-speed PCIe SSD to expedite model loading times. Specifically, PCIe SSDs can achieve speeds up to 25,000 MB/s, which is vastly superior to the 5,000 MB/s of the Mac's internal SSD. To put this into perspective, a 100GB model could be loaded in roughly 4 seconds on the PCIe SSD, compared to the 20 seconds it would take on the Mac SSD. That's a 5x speed difference.
That said, I also understand that once a model is loaded into memory, these loading times become irrelevant, as there's no need to reload the model.
I'm at a crossroads trying to decide between the two. Has anyone here had experience using quantzed models on the Mac Pro, and if so, could you share your insights?
Beta Was this translation helpful? Give feedback.
All reactions