Skip to content

Trying to optimise the prompt eval time for a fixed input token size for llava.cpp #4292

Answered by cmp-nct
jjiteshh asked this question in Q&A
Discussion options

You must be logged in to vote

When you process llava you have 3 different batch processings in sequence before the output is generated:

  1. system prompt
  2. image embeddings
  3. your question prompt
    In addition there is time spent to process the CLIP/ViT embeddings, currently on CPU.

I guess this could be optimized by converting the two text prompts into embeddings first and then combining the evaluation, allowing for larger batch processing in one run. But I am doubtful on the gains.

Looking at your general speed, you do not have a batch processing problem but a general performance problem.
I assume you tun this on very low hardware ? With a good GPU you can get thousands of tokens/second batch speed but you sit at 74.

When y…

Replies: 1 comment 6 replies

Comment options

You must be logged in to vote
6 replies
@cmp-nct
Comment options

@bleedingfight
Comment options

@cmp-nct
Comment options

@jjiteshh
Comment options

@bleedingfight
Comment options

Answer selected by jjiteshh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants