Parallel Requests support #8567

akhilreddy0703 · 2024-07-18T12:03:34Z

akhilreddy0703
Jul 18, 2024

Hi @ggerganov and community,

I did one experiment for comparing the capability of handling parallel requests between `TGI and Llamacpp` to understand how many parallel users a single instance can serve.

I did a docker deployment of llamacpp server with below configuration for Llama3-8b(int4)

Note: This is done on Genoa machine

docker run -d --name llama3-8b \
                     --cpuset-cpus "0-47" \
                     -p 8082:8082 \
                     -v $volume:/models \
                     ghcr.io/ggerganov/llama.cpp:server \
                     -m $model_path \
                     -c 40960 \
                     --no-mmap \
                     --threads 48 \
                     --parallel 100

Test details

Input Query length: 32 input tokens 
n_predict: 256 output tokens

I've tested this server for ( 1, 3, 10, 30, 100 ) parallel requests, I got approximate ( 25, 17, 4, 1, 0.5 ) tokens/sec for respective parallel requests mentioned.I'm observing a drastic drop in the throughput values.

Additional Info:

With TGI deployment for Llama3-8b(bf16) model,

I've also tested TGI server for ( 1, 3, 10, 30, 100 ) parallel requests, I got approximate ( 10, 9, 7, 5, 3 ) tokens/sec for respective parallel requests
It's an obvious thing that a lower precision (int4) model ideally should get better throughput compared to a half precision(bf16) model.
But, What I observed TGI is serving better as we can see a gradual decrease in throughput as parallel requests increase,
the similar behaviour is not seen with llamacpp.

The Question is:

Why do I see a drastic fall in throughput on llamacpp server which is hosting a quantized model( int4 ) compared to TGI server hosting a bf16 model ? Is it the problem with how llamacpp handles parallelization ?
Anyone explored on the issue for parallelization with llamacpp, I would like to hear your thoughts and please suggest good practices to get the best out of this ?

References:

Thanks in advance,
Akhilreddy G.

ggerganov · 2024-07-18T13:51:37Z

ggerganov
Jul 18, 2024
Maintainer

Hard to say. Can you run the same test without using the Docker stuff?

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# provide log from this command
LLAMA_DISABLE_LOGS=1 make -j

./llama-server \
    -p 8082 \
    -m $model_path \
    -c 40960 \
    --no-mmap \
    --threads 48 \
    --parallel 100

Also run the following benchmark and post results:

./llama-bench \
    -m $model_path \
    -p 1,2,4,8,10,16,32,64,100,128

6 replies

ggerganov Jul 30, 2024
Maintainer

Can you also show the same tests using F16 instead of Q4 model?

akhilreddy0703 Aug 1, 2024
Author

Here are the results with F16

with `llama-bench`:

with `Echoswift`:

ggerganov Aug 1, 2024
Maintainer

Can you repeat the Echoswift benchmark using F16 llama.cpp and these extra arguments:

-b 8192 -ub 128
-b 8192 -ub 1024
-b 8192 -ub 2048
-b 8192 -ub 4096
-b 8192 -ub 8192

akhilreddy0703 Aug 1, 2024
Author

yeah sure,
can you give an overview of how llama.cpp server handling the parallel requests, the slot concept ??

ggerganov Aug 1, 2024
Maintainer

The server processes tokens into batches of size '-b'. On each iteration, it will fit as much tokens as possible into the batch from all currently active slots. Each batch is additionally chunked by 'llama.cpp' into microbatches ('-ub') which depending on the backend can have different impact on the performance

sarthakd112 · 2024-08-06T09:48:56Z

sarthakd112
Aug 6, 2024

Hii @akhilreddy0703 @ggerganov,
Below are the results with the different params and I haven't observed much improvement with -b and -ub flags.

Results with -b 8192 -ub 128
Results with -b 8192 -ub 2048
Results with -b 8192 -ub 4096
Results with -b 8192 -ub 8192

3 replies

akhilreddy0703 Aug 6, 2024
Author

Thanks @sarthakd112 for the help,
Yeah It looks like there isn't much improvement in the results compared to the previous ones

@ggerganov, Do we have any other way to be able to serve parallel requests with good throghput ?? or Any plan to add the llamacpp backend support in TGI or vLLM as these inference servers are good at handling the issue we're facing.

Insights

As llama3-8b-f16 model has been served and tested using both TGI and Llamacpp servers here, we've also observed TTFT (Time To First Token) metric with Echoswift Benchmark

The ideal value for TTFT would be 2000ms or 2 sec --->Reference
with the below plot we can understand the server behaviour as increase in parallel requests vs TTFT.

ggerganov Aug 6, 2024
Maintainer

Hm, not sure. These results, especially the TTFT, are quite unexpected.

Could you give me a set of commands to run on my machine in order to perform this benchmark using Echoswift, so I can take a detailed look at what is happening?

sarthakd112 Aug 7, 2024

Sure @ggerganov,

These are the steps we followed.

Server Setup Instructions

Llamacpp

Start the server with below script

git clone https://github.com/ggerganov/llama.cpp 
cd llama.cpp`

# provide log from this command
LLAMA_DISABLE_LOGS=1 make -j

./llama-server \
    -b 8192 -ub 128 \
    -p 8082 \
    -m $model_path \
    -c 40960 \
    --no-mmap \
    --threads 48 \
    --parallel 100

TGI

model=meta-llama/Meta-Llama-3-8B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=Huggingface_token

docker run -u root --cpus 48 \
     --cpuset-cpus "0-47" \
     --memory 48GiB \
     --name llama3_8b_TGI \
     -d -e HUGGING_FACE_HUB_TOKEN=$token \
     -p 8082:80 \
     -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.1 \
     --model-id $model --disable-custom-kernels --dtype bfloat16

Setup for EchoSwift

To set up EchoSwift, run the following commands:

git clone https://github.com/Infobellit-Solutions-Pvt-Ltd/EchoSwift.git 
cd EchoSwift/EchoSwift

Setup the Environment

python3 -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt

Prepare the Datasets

Run the dataset filtering script:

python3 dataset_filtering.py

Configure Benchmarking

Open the llm_inference_benchmark.sh file and specify your input and output tokens as required.

In general we are testing for two use cases

Write a Description (32-input and 256-output tokens)
Summarization (256-input and 32-output tokens)

Note: The results we posted are with 32/256 combination

Run the Benchmark

Execute the benchmark with:

For Llamacpp

./llm_inference_benchmark.sh "output_dir" "http://localhost:8082/completion" "Llamacpp"

For TGI

./llm_inference_benchmark.sh "output_dir" "http://localhost:8082/generate_stream" "TGI"

Parallel Requests support #8567

Uh oh!

akhilreddy0703 Jul 18, 2024

I did one experiment for comparing the capability of handling parallel requests between TGI and Llamacpp to understand how many parallel users a single instance can serve.

Test details

Additional Info:

The Question is:

References:

Replies: 2 comments · 9 replies

Uh oh!

Uh oh!

ggerganov Jul 18, 2024 Maintainer

Uh oh!

ggerganov Jul 30, 2024 Maintainer

Uh oh!

akhilreddy0703 Aug 1, 2024 Author

with llama-bench:

with Echoswift:

Uh oh!

ggerganov Aug 1, 2024 Maintainer

Uh oh!

akhilreddy0703 Aug 1, 2024 Author

Uh oh!

Uh oh!

ggerganov Aug 1, 2024 Maintainer

Uh oh!

sarthakd112 Aug 6, 2024

Uh oh!

akhilreddy0703 Aug 6, 2024 Author

Insights

Uh oh!

ggerganov Aug 6, 2024 Maintainer

Uh oh!

Uh oh!

sarthakd112 Aug 7, 2024

Server Setup Instructions

Llamacpp

TGI

Setup for EchoSwift

Setup the Environment

Prepare the Datasets

Configure Benchmarking

Run the Benchmark

For Llamacpp

For TGI

akhilreddy0703
Jul 18, 2024

I did one experiment for comparing the capability of handling parallel requests between `TGI and Llamacpp` to understand how many parallel users a single instance can serve.

Replies: 2 comments 9 replies

ggerganov
Jul 18, 2024
Maintainer

ggerganov Jul 30, 2024
Maintainer

akhilreddy0703 Aug 1, 2024
Author

with `llama-bench`:

with `Echoswift`:

ggerganov Aug 1, 2024
Maintainer

akhilreddy0703 Aug 1, 2024
Author

ggerganov Aug 1, 2024
Maintainer

sarthakd112
Aug 6, 2024

akhilreddy0703 Aug 6, 2024
Author

ggerganov Aug 6, 2024
Maintainer