Help with delay in llama-server #11005

indie-ai · 2024-12-28T20:50:47Z

indie-ai
Dec 28, 2024

I'm developing myself a front-end for Llama.cpp (and others) that allows me to switch models dynamically. I've created a custom process server in the background using execve() to run any program on demand, currently I have the server running on the local machine. When my front end needs to use model X, it instructs the backend server to launch llama-server with specific arguments for that model. If another model Y is required later, it signals the server to terminate the current instance of llama-server (model X) and load model Y instead. This setup has been working well until I recently updated Llama.cpp by redownloading and recompiling the latest version (4393 (d79d8f3)).

Now, I'm encountering an issue where the first POST request to /completion takes up to 15 seconds to start inference, as indicated by monitoring GPU utilization. Subsequent requests are much faster. This delay only occurs on the initial post after starting llama-server via execve().

Here's a sample log for the first POST:

0.15.774.026 I slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 67, n_tokens = 67, progress = 1.000000
0.15.774.054 I slot update_slots: id  0 | task 0 | prompt done, n_past = 67, n_tokens = 67
0.15.774.055 D srv  update_slots: decoding batch, n_tokens = 67   <--14 seconds? ************************************
0.29.030.566 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token:    35 'D'
0.29.030.569 D srv  update_slots: run slots completed
...
prompt eval time =   13256.58 ms /    67 tokens (  197.86 ms per token,     5.05 tokens per second)
       eval time =    7969.70 ms /   561 tokens (   14.21 ms per token,    70.39 tokens per second)
      total time =   21226.28 ms /   628 tokens

And for the second POST:

0.41.991.717 I slot update_slots: id  0 | task 562 | prompt done, n_past = 67, n_tokens = 1
0.41.991.718 D srv  update_slots: decoding batch, n_tokens = 1
0.42.003.444 D slot process_toke: id  0 | task 562 | n_decoded = 1, n_remaining = 999, next token:    35 'D'
0.42.003.446 D srv  update_slots: run slots completed
...
prompt eval time =      11.84 ms /     1 tokens (   11.84 ms per token,    84.47 tokens per second)
       eval time =    7693.20 ms /   547 tokens (   14.06 ms per token,    71.10 tokens per second)
      total time =    7705.04 ms /   548 tokens

When I run the same command directly from the command line, there's no such delay:

Command used:

/media/TData/AI/llama.cpp/build/bin/./llama-server -v --log-file /media/ram/log2.txt --no-kv-offload --log-prefix --log-timestamps -t 2 -ngl 70 -c 8196 -m /media/TData/AI/llm/gguf/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --host 192.168.55.9

Sample log from command line:

0.08.124.116 I slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 67, n_tokens = 67, progress = 1.000000
0.08.124.134 I slot update_slots: id  0 | task 0 | prompt done, n_past = 67, n_tokens = 67
0.08.124.135 D srv  update_slots: decoding batch, n_tokens = 67
0.08.246.368 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token:    35 'D'
0.08.246.373 D srv  update_slots: run slots completed
...
prompt eval time =   13256.58 ms /    67 tokens (  197.86 ms per token,     5.05 tokens per second)
       eval time =    7969.70 ms /   561 tokens (   14.21 ms per token,    70.39 tokens per second)
      total time =   21226.28 ms /   628 tokens

I'm not sure what's causing this discrepancy when running via execve(). I suspect it might be related to environment variables or some changes in the latest version of the server, or more then likely something stupid I'm doing. Any ideas on how to resolve this issue? I switch models often and that delay is killing me. Would this have any thing to do with the prompt caching? I've tried turning it off with --no-kv-offload but that did not help.

On the old version (3772 (23e0d70)), running through execve() on the first POST:

0.11.084.865 I slot update_slots: id  0 | task 0 | prompt done, n_past = 67, n_tokens = 67
0.11.084.866 D srv  update_slots: decoding batch, n_tokens = 67
0.11.177.557 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token: 'D'
0.11.177.561 D srv  update_slots: run slots completed
...
prompt eval time =      92.89 ms /    67 tokens (    1.39 ms per token,   721.30 tokens per second)
       eval time =   11212.95 ms /   746 tokens (   15.03 ms per token,    66.53 tokens per second)
      total time =   11305.84 ms /   813 tokens

Don't know if this will be helpful at all, but when the server runs llama-server the code is below.

struct Process {
    pid_t pid;
    uint32_t cpid; // client pid
    std::string pathtocmd; // path to the cmd
    std::string pathtocd; // path to change directory to before running cmd
    std::string cmd;
    std::string name;
    std::vector<std::string> env;
    std::vector<std::string> args;

    int pipefd[2];        // For reading output from child
    int write_pipefd[2];  // For writing to the child

    std::string err;
};

pid_t runprocess(Process *prc) {
    if (pipe(prc->pipefd) == -1 || pipe(prc->write_pipefd) == -1) {
        perror("Pipe creation failed");
        return -1;
    }

    pid_t cpid = fork();
    if (cpid < 0) { // Error occurred
        std::cerr << "Error starting fork\n";
        close(prc->pipefd[0]);
        close(prc->pipefd[1]);
        close(prc->write_pipefd[0]);
        close(prc->write_pipefd[1]);
        return -1;
    } else if (cpid == 0) { // Child process
        std::string fcmd = prc->pathtocmd + prc->cmd;

        // Close unused ends of pipes
        close(prc->pipefd[0]);       // Read end of pipe for child's output
        close(prc->write_pipefd[1]); // Write end of write pipe

        // Redirect stdout and stderr to read end of pipe
        dup2(prc->pipefd[1], STDOUT_FILENO);
        dup2(prc->pipefd[1], STDERR_FILENO);

        // Close the write end after duplication
        close(prc->pipefd[1]);

        // Set up stdin from write_pipefd if needed
        dup2(prc->write_pipefd[0], STDIN_FILENO);
        close(prc->write_pipefd[0]); // No longer need this

        const char **envv = new const char* [prc->env.size() + 1];
        const char **argv = new const char* [prc->args.size() + 2];

        argv[0] = prc->cmd.c_str();
        envv[(int)prc->env.size()] = NULL;
        if (!prc->args.empty()) {
            for (size_t j = 0; j < prc->args.size(); ++j) {
                argv[j + 1] = prc->args[j].c_str();
            }
        }
        argv[prc->args.size() + 1] = NULL;

        int r = chdir(prc->pathtocd.c_str());
        if (r != 0) {
            std::cerr << "Error: Change Directory Failed" << r << "\n";
            exit(-1);
        }

        execve(fcmd.c_str(), const_cast<char**>(argv), const_cast<char**>(envv));

        // If exec fails
        perror("Exec failed");
        exit(EXIT_FAILURE);

    } else { // Parent process
        int retval = fcntl(prc->pipefd[0], F_SETFL, O_NONBLOCK);
        if (retval == -1) {
            std::cerr << "Error setting pipe to non-blocking\n";
            close(prc->pipefd[0]);
            close(prc->write_pipefd[1]); // Close write end of write pipe
            return -1;
        }

        // Close unused ends in parent
        close(prc->pipefd[1]);
        close(prc->write_pipefd[0]);

        prc->pid = cpid;

        return cpid;
    }
}

Any help or ideas would be appreciated.

edit to add:
I found this, seems like it is kinda related #9492 I missed it when searched before.

Answered by indie-ai

Apr 13, 2025

I recently got back into my ai project and figured I would build the latestest llamacpp. After rebuilding this delay problem was gone. Just wanted to report that the problem was fixed. Just as an FYI I went back and found what commit fixed the problem:

Jan 4 b56f079 TIME=14801
Feb 1 ecef206 TIME=14939
Feb 13 e437627 TIME=15224
Feb 14 38e32eb TIME=15231
Feb 14 dbc2ec5 TIME=15180
Feb 14 94b87f8 TIME=767
Feb 15 fc1b0d0 TIME=1097
Feb 16 c2ea16f TIME=1088
Feb 17 2eea03d TIME=1081
Feb 20 0d55958 TIME=659
Mar 1 80c41dd TIME=1081
Latest TIME=1071

Seems Feb 14 commit 94b87f8 the times are back down to around a second.

So, I guess this can be marked as answered/solved?

View full answer

ggerganov · 2024-12-31T11:24:56Z

ggerganov
Dec 31, 2024
Maintainer

Indeed #9492 seems related. Back then, I was able to reproduce the issue but didn't find what is causing the slowdown. If you could pinpoint the exact commit at which this starts happening, it would be very helpful.

1 reply

ggerganov Dec 31, 2024
Maintainer

Would this have any thing to do with the prompt caching? I've tried turning it off with --no-kv-offload but that did not help.

Btw, --no-kv-offload is only useful if you cannot fit the model in VRAM - it will keep the KV cache buffers in CPU RAM. This makes overall generation slower for cases where you have enough VRAM.

To disable prompt caching, add "cache_prompt": false to your requests, although I doubt it has anything to do with this problem.

indie-ai · 2024-12-31T20:30:29Z

indie-ai
Dec 31, 2024
Author

No problem, will work on that tomorrow.

0 replies

indie-ai · 2025-01-02T01:02:51Z

indie-ai
Jan 2, 2025
Author

Well, I worked on it a little bit today, figured out a few things. When I compiled with make it would work every time with no delays at least up until make was depreciated sometime in nov. So then I switched to cmake, and the delays started. Which got me thinking. I went back to the version that I was originally using and compiled with cmake and that commit was now delaying. I don't know much about make and cmake to be of any help with that area, but I can try any flags you want me too. The cmake compiles started to increase delay times in June. I'm still going to work on finding the actual commit where a decent jump in time is, but for now, I will just throw up what I have so far and maybe it will help. All of the tests used the same model, Meta-Llama-3.1-8B-Instruct-Q8_0.gguf. I'm running a 4090 24gb, Ryzen 7 5800X, 128gb ram, Linux Mint 21.3, Linux 5.15.0-130-generic.

I'm not that familiar with github so I looked up how to build an older commit and I found this, hope its right.

git clone https://github.com/ggerganov/llama.cpp
git reset --hard <commit #>

Using make:
make LLAMA_CUDA=1 LLAMA_CUDA_DMMV_X=32 LLAMA_CUDA_MMV_Y=1 LLAMA_CUDA_FA_ALL_QUANTS=1
--------------------------------------------------------------------------------------------------------
Sep 17, 2024: **NO delay**
version: 3773 (37f3a381)
0.02.122.523 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.122.523 D srv  update_slots: decoding batch, n_tokens = 37
0.02.155.150 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token: 'Hello'

Sep 23, 2024: **NO delay**
0.00.000.607 I build: 3809 (f3979df7) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.02.142.658 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.142.659 D srv  update_slots: decoding batch, n_tokens = 37
0.02.176.467 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token: 'Hello'

Sep 29, 2024: **NO delay**
0.00.000.455 I build: 3844 (0de8b203) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.02.172.025 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.172.026 D srv  update_slots: decoding batch, n_tokens = 37
0.02.206.576 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token: 'Hello'

Oct 4, 2024: **NO delay**
0.00.000.375 I build: 3879 (133c7b46) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.02.072.088 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.072.088 D srv  update_slots: decoding batch, n_tokens = 37
0.02.106.705 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token: 'Hello'

Oct 13, 2024: **NO delay**
0.00.000.380 I build: 3914 (c7181bd2) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.02.056.048 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.056.049 D srv  update_slots: decoding batch, n_tokens = 37
0.02.090.362 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token:  9906 'Hello'

Oct 28, 2024: **NO delay**
0.00.000.444 I build: 3984 (8125e6cb) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.02.084.777 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.084.777 D srv  update_slots: decoding batch, n_tokens = 37
0.02.119.323 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token:  9906 'Hello'

Nov 3, 2024: **NO delay**
0.00.000.379 I build: 4019 (08828a6d) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.02.088.791 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.02.088.792 D srv  update_slots: decoding batch, n_tokens = 37
0.02.125.602 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token:  9906 'Hello'

Make depreciated.

Using cmake:
cmake -B build -DGGML_CUDA=ON (unless otherwise stated)
cmake --build build --config Release
-------------------------------------------------------------------------------------------------------
Apr 30, 2024: **NO delay** **TIME=183ms** (timer starts before sending request to socket and ends with a "stop:true" in the json response)
cmake -B build -DLLAMA_CUDA=ON
using ./server
Commit: a8f9b076316e16aadd0791015b3bfd446fe1e904 (--version not implemented)

May 31, 2024: **NO delay** **TIME=174ms**
cmake -B build -DLLAMA_CUDA=ON
using ./server
Commit: a323ec60af14a33d560df98f2cc41b4112cb4f80 (--version not implemented)

Jun 15, 2024: **slight delay** **TIME=883ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3153 (0c7b3595)

Jun 16, 2024: **slight delay** **TIME=870ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3164 (df68d4fa)

Jun 17, 2024: **slight delay** **TIME=872ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3173 (a94e6ff8)

Jun 22, 2024: **TIME=2813ms** 
cmake -B build -DLLAMA_CUDA=ON
version: 3203 (b5a5f34e)

Jun 24, 2024: **TIME=2757ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3214 (9a590c82)

Jun 30, 2024: **delay**
version: 3268 (d0a7145b)

Jul 31, 2024: **delay**
version: 3499 (c8a00909)

Aug 27, 2024: **delay**
version: 3639 (20f1789d)

Sep 1, 2024: **delay**
version: 3651 (8f1d81a0)

Sep 13, 2024: **delay**
version: 3751 (feff4aa8)

Sep 16, 2024: **delay**
0.00.000.404 I build: 3772 (23e0d70b) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.06.233.887 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.06.233.887 D srv  update_slots: decoding batch, n_tokens = 37
0.11.272.119 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token: 'Hello'

Nov 15, 2024: **delay**
0.00.000.384 I build: 4089 (cbf5541a) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
0.06.382.417 I slot update_slots: id  0 | task 0 | prompt done, n_past = 37, n_tokens = 37
0.06.382.417 D srv  update_slots: decoding batch, n_tokens = 37
0.11.450.186 D slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 999, next token:  9906 'Hello'

latest commit delay time TIME=5457ms

I didn't add the timer until I had seen a gradual increase in delay times and had no way to time it. So I didn't get the times for the other results, but they were slow enough to know.

I will work more on this in my spare time. Thanks for helping me with this.

0 replies

indie-ai · 2025-01-02T03:52:50Z

indie-ai
Jan 2, 2025
Author

Ok, I found one jump in delay on June 20'th.

Jun 15, 2024: **TIME=883ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3153 (0c7b3595)

Jun 16, 2024: **TIME=870ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3164 (df68d4fa)

Jun 17, 2024: **TIME=872ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3173 (a94e6ff8)

Jun 18, 2024: **TIME=872ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3179 (84f6de17)

Jun 19, 2024: **TIME=869ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3186 (ba589931)

Jun 20, 2024: **TIME=883ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3187 (2075a66a)

Jun 20, 2024: **TIME=2766ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3188 (d50f8897)

Jun 20, 2024: **TIME=2780ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3189 (de391e4c)

Jun 20, 2024: **TIME=2773ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3190 (abd894ad)

Jun 20, 2024: **TIME=2780ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3191 (17b291a6)

Jun 20, 2024: **TIME=2771ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3192 (b1ef562b)

Jun 22, 2024: **TIME=2813ms** 
cmake -B build -DLLAMA_CUDA=ON
version: 3203 (b5a5f34e)

Jun 24, 2024: **TIME=2757ms**
cmake -B build -DLLAMA_CUDA=ON
version: 3214 (9a590c82)

0 replies

indie-ai · 2025-01-02T18:54:13Z

indie-ai
Jan 2, 2025
Author

The first jump happens on Jun 5.

Apr 30, 2024: **TIME=183ms**  Commit: a8f9b076316e16aadd0791015b3bfd446fe1e904
May 31, 2024: **TIME=174ms** Commit: a323ec60af14a33d560df98f2cc41b4112cb4f80
Jun 4, 2024: **TIME=164ms** version: 3089 (c90dbe02)
Jun 5, 2024: **TIME=163ms** version: 3091 (2b338967)
Jun 5, 2024: **TIME=850ms** version: 3092 (7d1a378b)
Jun 5, 2024: **TIME=845ms** version: 3093 (7672adee)
Jun 6, 2024: **TIME=852ms** version: 3101 (ee459f40)
Jun 8, 2024: **TIME=849ms* version: 3113 (5795b941)
Jun 15, 2024: **TIME=883ms** version: 3153 (0c7b3595)
Jun 16, 2024: **TIME=870ms** version: 3164 (df68d4fa)
Jun 17, 2024: **TIME=872ms** version: 3173 (a94e6ff8)
Jun 18, 2024: **TIME=872ms** version: 3179 (84f6de17)
Jun 19, 2024: **TIME=869ms** version: 3186 (ba589931)
Jun 20, 2024: **TIME=883ms** version: 3187 (2075a66a)
Jun 20, 2024: **TIME=2766ms** version: 3188 (d50f8897)
Jun 20, 2024: **TIME=2780ms** version: 3189 (de391e4c)
Jun 20, 2024: **TIME=2773ms** version: 3190 (abd894ad)

Both jumps reference MMQ. There is still at least one more time it jumps up because the latest commits time is around 5.5 seconds. The prompt I used is just "hello". Later commits compile times are long, so it will take a while to find them. I went through and looked for references of MMQ in later commits but there is many.

11 replies

indie-ai Jan 3, 2025
Author

The way my front end works is like an advanced rag (I wouldn't even call it rag). Running through a node tree, using different models. Depending on what the tree is doing, running a lua script, running inference, or other types of nodes. It might switch between a coding model, chat model, large models, small models, image models, etc. When there is a 5 second wait time, the node tree will take a while to finish. My current work around is using an older version from November compiled with make. I was thinking it would just be a simple flag seeing as how make works and cmake delays. Weird.

slaren Jan 3, 2025
Maintainer

If the delay can be reproduced reliably, it should be possible to see what's the cause by attaching a debugger while it is running and looking at the call stack. However, considering the symptoms, I think it is quite clear that this is due to the CUDA runtime initialization. We have too many kernels for the runtime to preload all of them, so it must do it on demand. Maybe the addition of more kernels caused an internal threshold to be hit that switches to this behavior. Maybe there is some optimization step that could be avoided by compiling for the real architecture of the hardware. This should already be the case if building with GGML_NATIVE on the same hardware and on a recent version of nvcc. In any case, I don't think there is much we can do about it.

indie-ai Jan 3, 2025
Author

I could just find a very small model and load that when I need to "unload" the model and leave llama-server running. I haven't tried, I'm assuming model load works through the api?

indie-ai Apr 13, 2025
Author

I recently got back into my ai project and figured I would build the latestest llamacpp. After rebuilding this delay problem was gone. Just wanted to report that the problem was fixed. Just as an FYI I went back and found what commit fixed the problem:

Jan 4 b56f079 TIME=14801
Feb 1 ecef206 TIME=14939
Feb 13 e437627 TIME=15224
Feb 14 38e32eb TIME=15231
Feb 14 dbc2ec5 TIME=15180
Feb 14 94b87f8 TIME=767
Feb 15 fc1b0d0 TIME=1097
Feb 16 c2ea16f TIME=1088
Feb 17 2eea03d TIME=1081
Feb 20 0d55958 TIME=659
Mar 1 80c41dd TIME=1081
Latest TIME=1071

Seems Feb 14 commit 94b87f8 the times are back down to around a second.

So, I guess this can be marked as answered/solved?

Answer selected by ggerganov

ggerganov Apr 14, 2025
Maintainer

Thank you for the follow up and the detailed report.

Help with delay in llama-server #11005

Uh oh!

Uh oh!

indie-ai Dec 28, 2024

Replies: 5 comments · 12 replies

Uh oh!

ggerganov Dec 31, 2024 Maintainer

Uh oh!

ggerganov Dec 31, 2024 Maintainer

Uh oh!

indie-ai Dec 31, 2024 Author

Uh oh!

indie-ai Jan 2, 2025 Author

Uh oh!

indie-ai Jan 2, 2025 Author

Uh oh!

indie-ai Jan 2, 2025 Author

Uh oh!

indie-ai Jan 3, 2025 Author

Uh oh!

slaren Jan 3, 2025 Maintainer

Uh oh!

indie-ai Jan 3, 2025 Author

Uh oh!

indie-ai Apr 13, 2025 Author

Uh oh!

ggerganov Apr 14, 2025 Maintainer

indie-ai
Dec 28, 2024

Replies: 5 comments 12 replies

ggerganov
Dec 31, 2024
Maintainer

ggerganov Dec 31, 2024
Maintainer

indie-ai
Dec 31, 2024
Author

indie-ai
Jan 2, 2025
Author

indie-ai
Jan 2, 2025
Author

indie-ai
Jan 2, 2025
Author

indie-ai Jan 3, 2025
Author

slaren Jan 3, 2025
Maintainer

indie-ai Jan 3, 2025
Author

indie-ai Apr 13, 2025
Author

ggerganov Apr 14, 2025
Maintainer