Response time issue #2166

07prashantg · 2025-04-24T13:06:45Z

07prashantg
Apr 24, 2025

I’ve deployed a private GPT model on an instance with the following configuration:

GPU: NVIDIA L4 (24 GB VRAM)
CPU: 25 vCPUs
RAM: 110 GB
CUDA Version: 12

I’m using the v1/chat/completions endpoint to generate MCQ questions (including options and the correct answer) based on provided content in the request body.

Issue:

Each individual request takes about 3 seconds to process. When I perform a load test at 1 request per second (RPS), some requests succeed, but about half of them result in Gateway Timeout errors.
Upon investigation, I found that:
The response is eventually generated, but it takes longer than the timeout threshold.
CPU utilization stays at around 4%, and memory usage is about 3%, so no clear resource bottleneck there.

Request:
Could someone please suggest:

What exact changes I can make to the current setup, architecture, or inference strategy to reliably handle at least 1 RPS?
Is there a way to optimize the model, inference pipeline, or request handling to reduce latency and prevent timeouts?

Any recommendations would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Response time issue #2166

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Response time issue #2166

Uh oh!

07prashantg Apr 24, 2025

Replies: 0 comments

07prashantg
Apr 24, 2025