Mistral-7B-Instruct-v0.2 inference slows down when implemented on the offline_inference.py example, possibly due to improper usage #4813
Closed
Alf-Z-SymphoMe
announced in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to utilize the Mistral-7B model, in a way that the input prompt contains a fixed part plus some variables derived from other Python scripts.
The way I run the model is by using
python run_model.py
: this runs the scripts to generate the variable part of the prompt (extracted from some data provided by the user) and the Mistral-7B model itself.However, the model is loaded each time I run the code, which takes most of the inference time. I would like to avoid this, and I am trying to figure out how the vllm library would be helpful in doing so.
I tried combining
offline_inference.py
in the vllm-project repo but, while it shortens the model loading time, it actually slows down the overall inference.Which of the examples in the vllm-project repo would be good for my case? Maybe the
api_client.py
?Beta Was this translation helpful? Give feedback.
All reactions