-
I'm developing myself a front-end for Llama.cpp (and others) that allows me to switch models dynamically. I've created a custom process server in the background using execve() to run any program on demand, currently I have the server running on the local machine. When my front end needs to use model X, it instructs the backend server to launch llama-server with specific arguments for that model. If another model Y is required later, it signals the server to terminate the current instance of llama-server (model X) and load model Y instead. This setup has been working well until I recently updated Llama.cpp by redownloading and recompiling the latest version (4393 (d79d8f3)). Now, I'm encountering an issue where the first POST request to /completion takes up to 15 seconds to start inference, as indicated by monitoring GPU utilization. Subsequent requests are much faster. This delay only occurs on the initial post after starting llama-server via execve(). Here's a sample log for the first POST:
And for the second POST:
When I run the same command directly from the command line, there's no such delay: Command used: /media/TData/AI/llama.cpp/build/bin/./llama-server -v --log-file /media/ram/log2.txt --no-kv-offload --log-prefix --log-timestamps -t 2 -ngl 70 -c 8196 -m /media/TData/AI/llm/gguf/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --host 192.168.55.9 Sample log from command line:
I'm not sure what's causing this discrepancy when running via execve(). I suspect it might be related to environment variables or some changes in the latest version of the server, or more then likely something stupid I'm doing. Any ideas on how to resolve this issue? I switch models often and that delay is killing me. Would this have any thing to do with the prompt caching? I've tried turning it off with --no-kv-offload but that did not help. On the old version (3772 (23e0d70)), running through execve() on the first POST:
Don't know if this will be helpful at all, but when the server runs llama-server the code is below. struct Process {
pid_t pid;
uint32_t cpid; // client pid
std::string pathtocmd; // path to the cmd
std::string pathtocd; // path to change directory to before running cmd
std::string cmd;
std::string name;
std::vector<std::string> env;
std::vector<std::string> args;
int pipefd[2]; // For reading output from child
int write_pipefd[2]; // For writing to the child
std::string err;
};
pid_t runprocess(Process *prc) {
if (pipe(prc->pipefd) == -1 || pipe(prc->write_pipefd) == -1) {
perror("Pipe creation failed");
return -1;
}
pid_t cpid = fork();
if (cpid < 0) { // Error occurred
std::cerr << "Error starting fork\n";
close(prc->pipefd[0]);
close(prc->pipefd[1]);
close(prc->write_pipefd[0]);
close(prc->write_pipefd[1]);
return -1;
} else if (cpid == 0) { // Child process
std::string fcmd = prc->pathtocmd + prc->cmd;
// Close unused ends of pipes
close(prc->pipefd[0]); // Read end of pipe for child's output
close(prc->write_pipefd[1]); // Write end of write pipe
// Redirect stdout and stderr to read end of pipe
dup2(prc->pipefd[1], STDOUT_FILENO);
dup2(prc->pipefd[1], STDERR_FILENO);
// Close the write end after duplication
close(prc->pipefd[1]);
// Set up stdin from write_pipefd if needed
dup2(prc->write_pipefd[0], STDIN_FILENO);
close(prc->write_pipefd[0]); // No longer need this
const char **envv = new const char* [prc->env.size() + 1];
const char **argv = new const char* [prc->args.size() + 2];
argv[0] = prc->cmd.c_str();
envv[(int)prc->env.size()] = NULL;
if (!prc->args.empty()) {
for (size_t j = 0; j < prc->args.size(); ++j) {
argv[j + 1] = prc->args[j].c_str();
}
}
argv[prc->args.size() + 1] = NULL;
int r = chdir(prc->pathtocd.c_str());
if (r != 0) {
std::cerr << "Error: Change Directory Failed" << r << "\n";
exit(-1);
}
execve(fcmd.c_str(), const_cast<char**>(argv), const_cast<char**>(envv));
// If exec fails
perror("Exec failed");
exit(EXIT_FAILURE);
} else { // Parent process
int retval = fcntl(prc->pipefd[0], F_SETFL, O_NONBLOCK);
if (retval == -1) {
std::cerr << "Error setting pipe to non-blocking\n";
close(prc->pipefd[0]);
close(prc->write_pipefd[1]); // Close write end of write pipe
return -1;
}
// Close unused ends in parent
close(prc->pipefd[1]);
close(prc->write_pipefd[0]);
prc->pid = cpid;
return cpid;
}
} Any help or ideas would be appreciated. edit to add: |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 12 replies
-
Indeed #9492 seems related. Back then, I was able to reproduce the issue but didn't find what is causing the slowdown. If you could pinpoint the exact commit at which this starts happening, it would be very helpful. |
Beta Was this translation helpful? Give feedback.
-
No problem, will work on that tomorrow. |
Beta Was this translation helpful? Give feedback.
-
Well, I worked on it a little bit today, figured out a few things. When I compiled with make it would work every time with no delays at least up until make was depreciated sometime in nov. So then I switched to cmake, and the delays started. Which got me thinking. I went back to the version that I was originally using and compiled with cmake and that commit was now delaying. I don't know much about make and cmake to be of any help with that area, but I can try any flags you want me too. The cmake compiles started to increase delay times in June. I'm still going to work on finding the actual commit where a decent jump in time is, but for now, I will just throw up what I have so far and maybe it will help. All of the tests used the same model, Meta-Llama-3.1-8B-Instruct-Q8_0.gguf. I'm running a 4090 24gb, Ryzen 7 5800X, 128gb ram, Linux Mint 21.3, Linux 5.15.0-130-generic. I'm not that familiar with github so I looked up how to build an older commit and I found this, hope its right.
Make depreciated.
I didn't add the timer until I had seen a gradual increase in delay times and had no way to time it. So I didn't get the times for the other results, but they were slow enough to know. I will work more on this in my spare time. Thanks for helping me with this. |
Beta Was this translation helpful? Give feedback.
-
Ok, I found one jump in delay on June 20'th.
|
Beta Was this translation helpful? Give feedback.
-
The first jump happens on Jun 5.
Both jumps reference MMQ. There is still at least one more time it jumps up because the latest commits time is around 5.5 seconds. The prompt I used is just "hello". Later commits compile times are long, so it will take a while to find them. I went through and looked for references of MMQ in later commits but there is many. |
Beta Was this translation helpful? Give feedback.
I recently got back into my ai project and figured I would build the latestest llamacpp. After rebuilding this delay problem was gone. Just wanted to report that the problem was fixed. Just as an FYI I went back and found what commit fixed the problem:
Jan 4 b56f079 TIME=14801
Feb 1 ecef206 TIME=14939
Feb 13 e437627 TIME=15224
Feb 14 38e32eb TIME=15231
Feb 14 dbc2ec5 TIME=15180
Feb 14 94b87f8 TIME=767
Feb 15 fc1b0d0 TIME=1097
Feb 16 c2ea16f TIME=1088
Feb 17 2eea03d TIME=1081
Feb 20 0d55958 TIME=659
Mar 1 80c41dd TIME=1081
Latest TIME=1071
Seems Feb 14 commit 94b87f8 the times are back down to around a second.
So, I guess this can be marked as answered/solved?