how can I get the best inference speed in my situation #9503
Unanswered
FranzKafkaYu
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello guys,I am working on llama.cpp in my Android device,and each time the inference will begin with the same pattern:
the
prompt.prefix
andprompt.suffix
are both constant and won't change,the only changed is User Input.Currently I am using the code below which is from simple.cpp in examples:two questions here:
llama_decode
will cost 1000ms+ and each time theinput_prefix
andinput_suffix
will be tokenized/decoded repeatedly,is ther any way to reuse the output after tokenize/decode theinput_prefix
andinput_suffix
?Hoping you guys can give me some advice,thanks!
Beta Was this translation helpful? Give feedback.
All reactions