Response time issue #2166
Unanswered
07prashantg
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I’ve deployed a private GPT model on an instance with the following configuration:
GPU: NVIDIA L4 (24 GB VRAM)
CPU: 25 vCPUs
RAM: 110 GB
CUDA Version: 12
I’m using the v1/chat/completions endpoint to generate MCQ questions (including options and the correct answer) based on provided content in the request body.
Issue:
Each individual request takes about 3 seconds to process. When I perform a load test at 1 request per second (RPS), some requests succeed, but about half of them result in Gateway Timeout errors.
Upon investigation, I found that:
The response is eventually generated, but it takes longer than the timeout threshold.
CPU utilization stays at around 4%, and memory usage is about 3%, so no clear resource bottleneck there.
Request:
Could someone please suggest:
What exact changes I can make to the current setup, architecture, or inference strategy to reliably handle at least 1 RPS?
Is there a way to optimize the model, inference pipeline, or request handling to reduce latency and prevent timeouts?
Any recommendations would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions