Fast model loading #209
Replies: 4 comments 3 replies
-
Yes, model caching (cluster and node level) are both in-scope and part of the roadmap. Lets evolve a design that plays well with rest of llm-d. |
Beta Was this translation helpful? Give feedback.
-
vLLM actually has a few different extensions for model loaders. One of them is the Run:ai Model Streamer. It uses multiple threads to read tensors concurrently from a file in some file or object storage to a dedicated buffer in the CPU memory. |
Beta Was this translation helpful? Give feedback.
-
A lesser known model loader extension is fastsafetensor that can be used when NVMes are available. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'd like to start the discussion on adding fast model loading capability to llm-d. model service seems the right starting point.
The goal is to ensure that safetensors model files are close to (ie. on the worker nodes) where vllm instances are created. When worker nodes have NVMe local storage, these files should be stored on it in order to enable direct transfer from storage to GPU.
GIE defines the concept of InferenceModel and whatever solutions we come up with should play nicely with this concept.
@sriumcp what's your opinion?
/cc @fabolive @manoelmarques @aavarghese
Beta Was this translation helpful? Give feedback.
All reactions