Recommendations to avoid model thrashing? #7937
Unanswered
TimothySeah
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a cluster of triton servers. Each of these loads a different model depending on the request it receives. However, because these models are large, there is a lot of "model thrashing" i.e. we waste time loading/unloading models that are too large to all fit in gpu memory. I was wondering if there is a general/canonical solution to this? For example, is there an easy way to route requests requiring a specific model to pods that already have that model loaded? Thanks.
Beta Was this translation helpful? Give feedback.
All reactions