Feature Request: Transition llma-server to FastCGI Instead of Using a Web Server #9653

mesibo · 2024-09-26T10:16:34Z

mesibo
Sep 26, 2024

Currently, llma-server is implemented using httplib. While it functions well, we believe that transitioning to a FastCGI server would be a more effective solution than continuing with a web server implementation. This change would simplify the codebase, allowing llama.cpp to focus solely on the API implementation without needing to manage the UI (HTML code) or edge functions like TLS, access control, load balancing, security, etc. Additionally, it would enable users to configure their preferred web server and llma-server serves as the FastCGI server. That way llama.cpp does not need to worry about web security issues when hosted on the public endpoint.

This simplifies coding as well. For example, fixed mappings like the following will be removed, as llama.cpp would no longer need to handle frontend:

svr->Get("/",                           handle_static_file(index_html, index_html_len, "text/html; charset=utf-8"));
svr->Get("/index.js",                   handle_static_file(index_js, index_js_len, "text/javascript; charset=utf-8"));
svr->Get("/completion.js",              handle_static_file(completion_js, completion_js_len, "text/javascript; charset=utf-8"));

We have taken the same approach for the mesibo on-premise server, and it has proven successful.

Thank you for considering!

ngxson · 2024-09-26T11:56:44Z

ngxson
Sep 26, 2024
Collaborator

IMO I don't see any benefit of using FastCGI. This proposal sounds like a marketing to me.

The frontend code that you mention is there for convenient. Our goal is to have one single binary that contains everything, so the html code is built into the cpp code.

This simplifies coding as well. For example, fixed mappings like the following will be removed, as llama.cpp would no longer need to handle frontend:

In reality, frontend code is non-important part of llama-server. Have you ever look at the code of the other handlers, for example /completions or /chat/completions?

0 replies

mesibo · 2024-09-26T12:27:56Z

mesibo
Sep 26, 2024
Author

Using FastCGI is not about marketing at all. FastCGI is a widely used, open protocol that allows separation between the app (llama-server in this case) and the web server. You can read more about it here: https://en.wikipedia.org/wiki/FastCGI

The benefit of the FastCGI approach is that it keeps application logic clean and separates it from web-serving. Instead of bundling everything (HTML, logic, server handling) into a single binary, Moreover, httplib lacks the scalability and security required for deployment on public endpoints, so it's better to use reliable and secure web servers while letting llama-server focus on core API functions.

In reality, frontend code is a non-important part of llama-server.

That is exactly why maintaining non-essential code in llama-server unnecessarily complicates things. This isn’t about adding complexity but about future-proofing the design and keeping the core logic focused on API.

The frontend code that you mention is there for convenient.

Hardcoding filenames like "index.js" is certainly not a good design practice. It’s unclear how this is "convenient." A much clearer approach is to let users decide their front-end implementation while using FastCGI to interface with the API. Given the prominence of llama.cpp and its growing user base, adopting such a flexible approach is crucial to accommodate a wider audience (hard coding is definitely not convenient).

Have you ever looked at the code of other handlers, for example, /completions or /chat/completions?

Yes, we're very familiar with the code. We've even implemented loading and unloading multiple models simultaneously for RAG, so we understand the setup well.

4 replies

slaren Sep 26, 2024
Maintainer

I think there is always going to be a place in llama.cpp for a small, self-contained server that does not require additional software to use, since that is something that many llama.cpp users are interested in. I can see how FastCGI support could be of interest to larger users, especially those that already have their own infrastructure in place and may find that adding llama.cpp via FastCGI more convenient for them, while also providing the security and scalability features they need. However, this should not be done at the expense of making the llama.cpp server harder to use for smaller users. I think the approach to support both uses cases would have to be to abstract the server code into a library that can be used to implement both types of servers, and the result would likely be a significant increase to the overall complexity and size of the code.

mesibo Sep 26, 2024
Author

Many small users are already familiar with setting up FastCGI—it’s widely used with PHP, for example. So, incorporating FastCGI into llama.cpp won’t create a barrier for them.

I agree with you about having a small self-contained server in llama.cpp, and supporting both approaches wouldn’t necessarily increase the complexity or size of the codebase. FastCGI can be the core API server, and the self-contained server for local experiments would simply use the FastCGI API to send requests. This approach keeps the system modular without code duplication.

ngxson Sep 26, 2024
Collaborator

Many small users are already familiar with setting up FastCGI—it’s widely used with PHP

I disagree with this. In the world of LLM, most server implementations are based on python (vllm, fastchat), rust (tgi) or go (ollama), which simply expose a http server. So contrary to what you said, I believe that not many users are familiar with FastCGI or PHP.

I understand that FastCGI could help separating the code that handle inference and HTTP code. But to be honest, I don't think HTTP of llama.cpp is that much. httplib is just a small part of server.cpp, most complexity is for the task scheduling, so reducing http code even more doesn't really add any real benefits.

mesibo Sep 26, 2024
Author

I believe the bigger picture is being overlooked here. This isn't about PHP, Python, Rust, Go, or just reducing HTTP code; it’s about building a clean, scalable, and secure architecture for deploying the llama.cpp APIs. No scalable architecture should tightly couple an HTTP server with core functionality, regardless of past implementations.

Implementing llama-server as a FastCGI server would provide a strong foundation for public deployment. In contrast, relying on a limited, tightly coupled HTTP server significantly restricts options and exposes the system to vulnerabilities, such as DoS. This either confines it to local use or forces llama.cpp to implement additional functionalities for public deployment, which distracts from its core focus—neither scenario is ideal.

A secure, load-balanced, multi-homed cloud deployment can instead use an internal load-balanced FastCGI server, ensuring security and scalability. This approach keeps llama.cpp adaptable and future-proof for various use cases in the rapidly evolving LLM landscape.

It's time we do it now, or there will be a time when we have to do it in the future. Anyway, we rest our case here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Transition llma-server to FastCGI Instead of Using a Web Server #9653

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature Request: Transition llma-server to FastCGI Instead of Using a Web Server #9653

Uh oh!

Uh oh!

mesibo Sep 26, 2024

Replies: 2 comments · 4 replies

Uh oh!

ngxson Sep 26, 2024 Collaborator

Uh oh!

mesibo Sep 26, 2024 Author

Uh oh!

slaren Sep 26, 2024 Maintainer

Uh oh!

mesibo Sep 26, 2024 Author

Uh oh!

Uh oh!

ngxson Sep 26, 2024 Collaborator

Uh oh!

mesibo Sep 26, 2024 Author

mesibo
Sep 26, 2024

Replies: 2 comments 4 replies

ngxson
Sep 26, 2024
Collaborator

mesibo
Sep 26, 2024
Author

slaren Sep 26, 2024
Maintainer

mesibo Sep 26, 2024
Author

ngxson Sep 26, 2024
Collaborator

mesibo Sep 26, 2024
Author