How to achieve faster prompt eval time with a single GPU: NVIDIA A100-PCIE-40GB? #6480
-
Right now I'm getting these results with
In general, my goal is to make an interactive application where the user uploads a document and then asks questions. This document usually contains at most 5000 tokens. With RAG I'm automatically finding the most relevant sentences in this document based on what the user wants therefore I cannot cache it beforehand. Is there a way to achieve faster prompt eval time (faster time to first token) so the user doesn't have to wait a few seconds before he starts seeing some output? I tried Exllamav2 with
However, the quality of the output is worse than with These are my settings:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
#6505 seems related |
Beta Was this translation helpful? Give feedback.
#6505 seems related