Decreased speed when Running AI Program in Parallel #4369

n1330 · 2023-12-08T00:35:25Z

n1330
Dec 8, 2023

Hello,

I am currently using llama Cpp, and I have encountered an issue with running tasks in parallel. I am attempting to run it on 4 separate tasks parallelly, but I have noticed a significant decrease in performance (speed) compared to when running one task. The processing time becomes surprisingly slow, which costs about 2 minutes for 4 questions with codellama-13b-instruct.Q2_K.gguf.
(In fact, I thought only 3 was in parallel processing)

Here is a description of my working environment and configuration:

Platform: AWS EC2 instance with type of g3.8xlarge
CPU: 32 vCPU; Memory 244GiB.
GPU: NVIDIA Tesla M60 GPU 2*8GiB.
Command Executed:

git clone https://github.com/ggerganov/llama.cpp.git
make LLAMA_CUBLAS=1
/home/ubuntu/server/server -m /home/ubuntu/models/codellama-13b-instruct.Q2_K.gguf --port xxxx --host xxxx -t 9 --tensor-split 5,5 -t 32 -ngl 40 -c 16384 -b 512 -np 4

Single Task Runtime: Average time from request to completion of generation was 29s.
The prompt eval speed was 52.884 tokens/second.
Eval speed was 6.498 tokens/second.

Parallel Processed Task Runtime: 10 requests were submitted. Initially, the first one was addressed.

slot 0 : kv cache rm - [0, end)
prompt eval time =  …  35.03 tokens per second)
 eval time =  …    7.87 tokens per second)
total time =   28281.47 ms

Once the first one was completely answered, the next three questions were processed and responded to. For the final six questions, the connections were closed.
Note that, judging from the following output, I thought that only questions 2 to 4 were processed in parallel, whereas the first question was not.

slot 1 : kv cache rm - [0, end)
slot 2 : kv cache rm - [0, end)
slot 3 : kv cache rm - [0, end)
slot 0 released (274 tokens in cache)
…
prompt eval time =  …  16.42 tokens per second)
 eval time =  …    1.56 tokens per second)
slot 2 released (131 tokens in cache)

prompt eval time =  …  16.42 tokens per second)
 eval time =  …    1.55 tokens per second)
slot 3 released (148 tokens in cache)

prompt eval time =  …  16.43 tokens per second)
 eval time =  …    2.97 tokens per second)
slot 1 released (276 tokens in cache)

Could there be potential issues with the parallel processing feature of the program? Or are there any configuration settings that I might be missing for optimal parallel execution? I would appreciate your guidance to resolve this performance issue.

Thank you
Best regards

gaord · 2023-12-12T17:11:12Z

gaord
Dec 12, 2023

the parallel processing doesn't work with the latest code. wonder if there is one point when it was working.
server + api_like_OAI.py, the latter logs error:
[2023-12-11 15:36:44,309] ERROR in app: Exception on /v1/chat/completions [POST]
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/requests/models.py", line 971, in json
return complexjson.loads(self.text, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/init.py", line 346, in loads
File "/opt/homebrew/Cellar/python@3.11/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/decoder.py", line 337, in decode
File "/opt/homebrew/Cellar/python@3.11/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/decoder.py", line 355, in raw_decode
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/flask/app.py", line 2190, in wsgi_app
response = self.full_dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/flask/app.py", line 1486, in full_dispatch_request
rv = self.handle_user_exception(e)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
rv = self.dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pom/AIGC/llama.cpp/examples/server/api_like_OAI.py", line 173, in chat_completions
print(data.json())
^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/requests/models.py", line 975, in json
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decreased speed when Running AI Program in Parallel #4369

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Decreased speed when Running AI Program in Parallel #4369

Uh oh!

n1330 Dec 8, 2023

Replies: 1 comment

Uh oh!

gaord Dec 12, 2023

n1330
Dec 8, 2023

gaord
Dec 12, 2023