Trying to optimise the prompt eval time for a fixed input token size for llava.cpp #4292

jjiteshh · 2023-12-02T09:18:02Z

jjiteshh
Dec 2, 2023

I am trying to read and modify the llava-cli.cpp, llava.cpp, llama.cpp in hope that i can improve prompt eval time

My total token input is limited to 644 tokens. this incudes the image context and the text context.

As you see the prompt eval time is the the most for my case and i plan to keep input at fixed length.

Even After setting the batch_size to token length like 644 or higher. the tokens are processed in batch less than the input value

Why is the batch processed in steps 35, 576, 33 and 1. I would like this to happen one one go to maybe speed up the process.

Answered by cmp-nct

Dec 2, 2023

When you process llava you have 3 different batch processings in sequence before the output is generated:

system prompt
image embeddings
your question prompt
In addition there is time spent to process the CLIP/ViT embeddings, currently on CPU.

I guess this could be optimized by converting the two text prompts into embeddings first and then combining the evaluation, allowing for larger batch processing in one run. But I am doubtful on the gains.

Looking at your general speed, you do not have a batch processing problem but a general performance problem.
I assume you tun this on very low hardware ? With a good GPU you can get thousands of tokens/second batch speed but you sit at 74.

When y…

View full answer

cmp-nct · 2023-12-02T14:36:49Z

cmp-nct
Dec 2, 2023

When you process llava you have 3 different batch processings in sequence before the output is generated:

system prompt
image embeddings
your question prompt
In addition there is time spent to process the CLIP/ViT embeddings, currently on CPU.

I guess this could be optimized by converting the two text prompts into embeddings first and then combining the evaluation, allowing for larger batch processing in one run. But I am doubtful on the gains.

Looking at your general speed, you do not have a batch processing problem but a general performance problem.
I assume you tun this on very low hardware ? With a good GPU you can get thousands of tokens/second batch speed but you sit at 74.

When your hardware is so weak you should first optimize the configuration, you can likely gain a lot more from that.
Start with a smaller batch size, try 32, 64 instead of 512. Look at other processes nagging at your hardware, at thread count, gpu layers offloaded, maybe reducing context size to reduce memory footprint.

6 replies

cmp-nct Dec 2, 2023

Ok I lack experience with Apple, you might want to look here: #4167

In general, you should try different quantizations, it's not just about the "bits" in a quantization it's a lot about how the kernel unpacks it. So some quants will be faster than others.
You can also look at the 7B model, it's not that much worse than the 13B. llava is a lot about getting image details out and not about reasoning. The smaller models are sufficient.
It looks like you can expect 200+ tokens/sec performance on your box when choosing 7B

bleedingfight Dec 8, 2023

@cmp-nct hi,I want to use llava generate description of images,But I'm not sure if llava can output descriptions for all given images
。I had modified clip for batch images,but llama_decode can use for batch embedding?thanks for your replay.

cmp-nct Dec 8, 2023

Just feed the images one by one, in sequence.
Batch processing, in the current context, means that the ggml library will process multiple embeddings/tokens at once.

When you input an image and ask for one decsription you already got a thousand "tokens" to batch-evaluate.
So the best you can do is to feed one image after another (without restarting the program of course) and batch process each image task.
Not all images at once.

To batch process images you'd need a much more complex code and a multi GPU server - that's something I'd not use llama.cpp for at this point, python has more support for largest scale use.

jjiteshh Dec 9, 2023
Author

@cmp-nct Do you think the Prompt Processing + Eval Processing time will be faster using pytorch or llava.cpp? I know that llava still has the issue of clip model using CPU. But if we had to consider running this on something like RTX 4090 or 3090 ti. Should i consider running pytorch or cpp?

bleedingfight Dec 12, 2023

@cmp-nct thanks，I was planning to try modifying the code to support multiple batches inference for images。But it seems very troublesome。I had try it with python llava to solve it，but the speed is still very slow（RTX3090+batch=4）。Do you have any good suggestions on how to accelerate inference in multiple batches?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trying to optimise the prompt eval time for a fixed input token size for llava.cpp #4292

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Trying to optimise the prompt eval time for a fixed input token size for llava.cpp #4292

Uh oh!

Uh oh!

jjiteshh Dec 2, 2023

Replies: 1 comment · 6 replies

Uh oh!

cmp-nct Dec 2, 2023

Uh oh!

cmp-nct Dec 2, 2023

Uh oh!

bleedingfight Dec 8, 2023

Uh oh!

cmp-nct Dec 8, 2023

Uh oh!

jjiteshh Dec 9, 2023 Author

Uh oh!

bleedingfight Dec 12, 2023

jjiteshh
Dec 2, 2023

Replies: 1 comment 6 replies

cmp-nct
Dec 2, 2023

jjiteshh Dec 9, 2023
Author