How to pass in multiple inputs at once? #3222

novice03 · 2023-09-16T21:42:14Z

novice03
Sep 16, 2023

Hello, I'm trying to use llama.cpp for text summarization on my dataset of >100,000 .txt files. I see that there is an option (-f) which lets the model read input from a file. Is it possible to process multiple files at once? How does this relate to the batch_size option (-b)?

Answered by KerfuffleV2

Sep 17, 2023

@novice03

No, I would like to get a summary for each file separately.

On Unix-type OSes this is really easy. You wouldn't want to do it quite this simply, but just as an example:

for X in *.txt; do ./main -m whatever -f "$X" > "${X}.out" 2>"${X}.err" ; done

That'll work with sh compatible shells like bash, zsh, etc and just runs main on every text file in the current directory, saving the output from stdout to "filename.txt.out" and output from stderr to "filename.txt.err".

However, I can try to arrange some compute - either CPUs or GPUs.

GPU is generally going to be a lot faster than CPU. Also even GPUs without a lot of memory can still speed up prompt processing a lot. Assuming you…

View full answer

KerfuffleV2 · 2023-09-16T23:37:08Z

KerfuffleV2
Sep 16, 2023
Collaborator

Is it possible to process multiple files at once?

Not at the moment, but most models have a context limit around 4096 tokens - that includes the prompt and the output. You're not thinking you can feed all those files to a model and get an overall summary, right?

If not, you can just make a simple script that calls main or whatever on each file and does something with the output. Unless they're very small or you have a super beefy system/GPU it's going to take a lonnng time to process 100,000 files though even if they're pretty small.

6 replies

novice03 Sep 17, 2023
Author

You're not thinking you can feed all those files to a model and get an overall summary, right?

No, I would like to get a summary for each file separately. In fact, each file has >10,000 tokens, so I'm having to split it into 2-3 parts.

Unless they're very small or you have a super beefy system/GPU ...

Unfortunately, the files are pretty large. However, I can try to arrange some compute - either CPUs or GPUs. If you don't mind, I have a few questions regarding this:

Is there any way to run multiple copies of the script in parallel (maybe of multiple CPU cores, or GPUs)?
Is there a general consensus on what configuration (quantization, BLAS, etc.) gives the best performance? I understand that this is mostly hardware-dependent, but I'm very new to BLAS and AVX. It would be very helpful even if you can point me to some resources that can help me set the parameters optimally.

gileneusz Sep 17, 2023

That's also my use case. I want to make multiple summarization. So do you mean that I would need to load the model into memory every time I would like to summarize the next file, or chunk of text?

KerfuffleV2 Sep 17, 2023
Collaborator

@novice03

No, I would like to get a summary for each file separately.

On Unix-type OSes this is really easy. You wouldn't want to do it quite this simply, but just as an example:

for X in *.txt; do ./main -m whatever -f "$X" > "${X}.out" 2>"${X}.err" ; done

That'll work with sh compatible shells like bash, zsh, etc and just runs main on every text file in the current directory, saving the output from stdout to "filename.txt.out" and output from stderr to "filename.txt.err".

However, I can try to arrange some compute - either CPUs or GPUs.

GPU is generally going to be a lot faster than CPU. Also even GPUs without a lot of memory can still speed up prompt processing a lot. Assuming your use case is a big prompt (the text to summarize) and relative small output (the summary) then the prompt input speed is really going to be important. Of course, ideally you'd be able to run everything on the GPU but we don't live in an ideal world so looking at a compromise to optimize prompt ingestion could be worth it.

Is there any way to run multiple copies of the script in parallel (maybe of multiple CPU cores, or GPUs)?

You can (by scripting or whatever, it's not built into llama.cpp) but generally this is supposed to use the available resources when running. In other words, it should already use whatever cores are are available or GPUs assuming you have it configured correctly. In that case, running multiple copies probably isn't going to help much compared to just running it sequentially.

If you haven't already, I'd suggest looking at the README in examples/main here and also running the app with --help to see available options. There are options for offloading to the GPU, number of threads to use, etc. It's also possible to do stuff like use multiple GPUs or use a specific GPU.

Is there a general consensus on what configuration (quantization, BLAS, etc.) gives the best performance?

You mean performance as in run-time or performance as in quality? There's some information here: https://github.com/ggerganov/llama.cpp#quantization

That doesn't cover the k-quants (quantizations like Q4_K_M, etc) which are generally better at the same file size. I believe the quality is better, but the runtime performance may be a little worse. I usually use Q4_K_M myself, I feel like it's a good tradeoff between quality and memory use/file size and has decent performance also.

@gileneusz

So do you mean that I would need to load the model into memory every time I would like to summarize the next file, or chunk of text?

It's not as bad as it sounds because by default the model gets memory mapped and assuming you have sufficient available memory it'll already be in your OS's filesystem cache the second time you try to load it. It'll only take a couple seconds at most to lead when that's the case and will probably only take a small amount of time relative to how long processing the prompt and generating output will take.

Of course, llama.cpp isn't just main (it's in examples/ for a reason), it's also a library that can be used by other stuff. So you can potentially write (or hire some to write) your own tools to keep the model in memory or whatever if you find that part to be a problem. There's also stuff like the server example, bindings to various programming languages like Python, etc.

Answer selected by novice03

gileneusz Sep 17, 2023

even with mapping, loading Falcon 180B Q4 would take about 20 seconds. I wrote another post on this:
#3226

all these prompt processing stuff is to difficult for me to understand at my knowledge level 😢

KerfuffleV2 Sep 17, 2023
Collaborator

even with mapping, loading Falcon 180B Q4 would take about 20 seconds. I wrote another post on this:
#3226

I don't quite understand how that post is relevant since it seems like you're talking about the storage device speed. The case I'm talking about is when it's in the OS's cache. In other words, it's already in RAM, so SSD or other storage speed isn't a factor at all.

gileneusz Sep 17, 2023

Thanks for the explanation, I now understand how the OS caching works in relation to loading the model. I wasn't fully grasping what was meant by RAM cache before.

thistleknot · 2023-09-17T17:30:10Z

thistleknot
Sep 17, 2023

I'm interested in batch inference as well.
for example, the english_quotes dataset. I want to have a model 'unpack' each quote. The quotes themselves are usually really small, less than 128 tokens. I could ask for an unpacked summary of 256 tokens. 256 + 128 = 384

4096/384 = 10.6
so I could get away with a batch size of 10 prompts per sequence (padding until max_length which should include response spaces)
not to mention the # of sequences I can get out of my vram.
which would be my batch size.

0 replies

How to pass in multiple inputs at once? #3222

Uh oh!

novice03 Sep 16, 2023

Replies: 2 comments · 6 replies

Uh oh!

KerfuffleV2 Sep 16, 2023 Collaborator

Uh oh!

novice03 Sep 17, 2023 Author

Uh oh!

Uh oh!

gileneusz Sep 17, 2023

Uh oh!

Uh oh!

KerfuffleV2 Sep 17, 2023 Collaborator

Uh oh!

Uh oh!

gileneusz Sep 17, 2023

Uh oh!

Uh oh!

KerfuffleV2 Sep 17, 2023 Collaborator

Uh oh!

gileneusz Sep 17, 2023

Uh oh!

Uh oh!

thistleknot Sep 17, 2023

novice03
Sep 16, 2023

Replies: 2 comments 6 replies

KerfuffleV2
Sep 16, 2023
Collaborator

novice03 Sep 17, 2023
Author

KerfuffleV2 Sep 17, 2023
Collaborator

KerfuffleV2 Sep 17, 2023
Collaborator

thistleknot
Sep 17, 2023