Expected speed up over HuggingFace #662
deepakdalakoti
announced in
Q&A
Replies: 1 comment
-
same here , not seeing much improvement in latency , although batch inferencing is much faster with vllm compared huggingface. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
As per the vllm homepage the results show a huge speed up over huggingface API. However, my benchmarking results show only a modest improvement over Huggingface (~15% ). I am using
Llama-2-7b-hf
for testing. I used the code and prompt described here for testing. Is anyone seeing similar results? Are there some settings which should improve the results significantly? I am using the following librariesCUDA: 11.8
vllm: 0.1.3
Beta Was this translation helpful? Give feedback.
All reactions