You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ModelService creates compute and non-compute resources. Under compute resources, decode and prefill deployment can be made, where each can have multiple replicas. The default Scheduler will spread such pods. We need a mechanism to pack pods on a node. Here is a scenario: Assume the cluster has two nodes, each having two GPUs. If prefill and decode deployment with one replica places a pod on both the nodes, this will cause the next workload replica requesting two GPUs per pod not to start, even if the cluster has two GPUs.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
ModelService creates compute and non-compute resources. Under compute resources, decode and prefill deployment can be made, where each can have multiple replicas. The default Scheduler will spread such pods. We need a mechanism to pack pods on a node. Here is a scenario: Assume the cluster has two nodes, each having two GPUs. If prefill and decode deployment with one replica places a pod on both the nodes, this will cause the next workload replica requesting two GPUs per pod not to start, even if the cluster has two GPUs.
Beta Was this translation helpful? Give feedback.
All reactions