./server crash with two simultaneous requests #2369

aospan · 2023-07-24T10:42:22Z

aospan
Jul 24, 2023

Hi!

I opened two browser tabs and started two concurrent requests. I see ./server crashing 100% of times. Here is gdb bt from the generated core dump:

(base) llama-cpp-user@9f4d566a4cce:~/llama.cpp/build/bin$ gdb ./server core.87897
...

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `./server -m /models/llama-2-13b-chat.ggmlv3.q2_K.bin --host 0.0.0.0'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000564eaee0921a in ggml_graph_compute_thread ()
[Current thread is 1 (Thread 0x7fab5c3f9700 (LWP 91106))]
(gdb) bt
#0  0x0000564eaee0921a in ggml_graph_compute_thread ()
#1  0x00007fad19a3aea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#2  0x00007fad1995aa2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

Is this happening only for me? I'm running latest code.

Thanks!

slaren · 2023-07-24T11:10:54Z

slaren
Jul 24, 2023
Maintainer

If the server is creating a thread for every request and reusing the same llama objects in all of them (llama_model, llama_context, etc), that's definitely a problem.

0 replies

snichols · 2023-07-25T12:48:28Z

snichols
Jul 25, 2023

I found and fixed this bug in my local fork. The problem is that mutex handling in the chunked response path is wrong.

svr.Post("/completion", [&llama](const Request &req, Response &res) {
    auto lock = llama.lock(); // this takes the llama mutex
    ...
    } else {
        const auto chunked_content_provider = [&](size_t, DataSink & sink) {
            // generate tokens in a tight loop and send them to the sink
        }
        res.set_chunked_content_provider("text/event-stream", chunked_content_provider);
    }
    // mutex is unlocked here, before the chunked_content_provider is invoked
});

My fix was to detach the mutex from the unique_lock using ::release and manually unlock it when the chunked content provider finished. set_chunked_content_provider takes an optional parameter for a callback when the stream is finished for any reason. I hooked into that.

This fix isn't technically safe since one should generally unlock a mutex from the thread that locked it. httplib processes a request from start to finish in the same thread, so that's working. Albeit this could break horribly if that behavior changes upstream.

I'm seeing all manner of odd behavior in the llama.cpp server code. I'm gonna do further investigations to get to the bottom of what I'm seeing before I submit a PR cleaning things up. I'm in need of a production-quality server implementation wrapping llama.cpp. I'm on the fence on whether I should refactor what's here or make a custom protocol so I can more easily communicate between llama.cpp instances and a host application in Golang.

But, I digress... the most troubling issue I'm seeing is that token generation behavior seemingly changes based on network performance. I've got a server deployed and, during testing, I used my mobile phone connected over the cell network. A significant percentage of the tokens generated were corrupted or lost to just that client. I'm not sold that it's a network issue. Indeed, it's smelling like another race condition. I'll do a deep dive on this today and figure out how best to resolve it.

2 replies

snichols Jul 25, 2023

Here's an example of the behavior I'm seeing. The first generation asking about alpacas was performed on my LAN. The second generation about llamas was done on the cell network. Dramatic differences!

snichols Jul 25, 2023

Yeah, turning on network throttling in chrome exhibits similar problems. Should be easy enough to fix.

snichols · 2023-07-25T17:05:43Z

snichols
Jul 25, 2023

My PR with the fixes: #2391

1 reply

moonape1226 Jul 27, 2023

Cool! That's the exactly problem I am facing.
Thanks for the patch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

./server crash with two simultaneous requests #2369

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

./server crash with two simultaneous requests #2369

Uh oh!

aospan Jul 24, 2023

Replies: 3 comments · 3 replies

Uh oh!

slaren Jul 24, 2023 Maintainer

Uh oh!

Uh oh!

snichols Jul 25, 2023

Uh oh!

snichols Jul 25, 2023

Uh oh!

snichols Jul 25, 2023

Uh oh!

snichols Jul 25, 2023

Uh oh!

moonape1226 Jul 27, 2023

aospan
Jul 24, 2023

Replies: 3 comments 3 replies

slaren
Jul 24, 2023
Maintainer

snichols
Jul 25, 2023

snichols
Jul 25, 2023