Repository for Local LLM API Server solution.
This project aims to facilitate the local deployment of LLM model APIs for various purposes, with a primary focus on optimized CPU execution.
Suported OS: Linux
- Easy path: Run the setup.sh file with sudo:
source setup.sh
and then:
sudo ./setup.sh
NOTE: Remember to activate the virtual environment before running the server if it's not already activated.
source .venv/bin/activate
- Manual installation:
- Setup a virtual environment:
sudo apt install python3.10-venv
python3 -m venv .venv
source .venv/bin/activate
- Install the dependencies:
sudo apt-get install gcc
export CC=gcc
sudo apt-get install g++
export CXX=g++
sudo apt-get install make
- Install the requirements:
pip install -r requirements.txt
- Create a models folder in path:
mkdir models
-
Create a folder named "models" in the root path of the project if not exists.
-
Download a .gguf model (last supported for LlamaCPP):
-
Recomended repository: https://huggingface.co/TheBloke Credit: Tom Jobbins (TheBloke)
-
The model has to be GGML for
llama-cpp-python==0.1.65
ans lesser or GGUF for newer versions.
-
-
Put the model in the models folder.
NOTE: Download a model direct in the folder throug console with
wget [download_model url]
-
Configure a profile for a model:
a. Go to
data/config.json
file b. A profile contains the configuration values of a particular LLM, allowing you to have different profiles set up for various purposes. For instance, a profile for a chatbot looks like:{ "SelectedModel": "Vicuna-13b", "Vicuna-13b": { "model": "vicuna-13b-v1.5.Q5_K_M.gguf", "n_ctx": 4096, "temperature": 0.7, "max_tokens": 16000, "top_p": 0.8, "top_k": 40, "verbose": true, "repeat_penalty": 1.1, "template": "You are a chatbot. \n\n{chat_history} \n\nPROMPT: {input} \n\nBOT:" } }
Where:
- SelectedModel is the profile in use
- Vicuna-13b is the name of the profile
- model is the name of the model .gguf file in models folder
- n_ctx determines the maximum context length for the model.
- temperature controls the randomness of the model's output. Higher values make the output more random, while lower values make it more deterministic.
- max_tokens sets a limit on the number of tokens in the model's response.
- top_p is used for nucleus sampling, where the model only considers the most likely tokens that make up a certain portion of the cumulative probability distribution.
- top_k limits the number of the most likely tokens to consider during response generation. Higher values result in more diversity, but excessive values may lead to incoherent output.
- verbose: indicates that the model will provide detailed or additional information alongside the responses.
- repeat_penalty discourages the model from repeating the same tokens in the response.
- template It's where the system prompt (LLM's mission and features) and variables are located. You can modify this parameter in order to achieve a specific goal for your LLM.
NOTE: Learn more of quantized LLMs here What are Quantized LLMs?
c. To add a new profile, put a new key-value segment after "SelectedModel" or after other profile:
{ "SelectedModel": "Vicuna-13b", "Vicuna-13b": { "model": "vicuna-13b-v1.5.Q5_K_M.gguf", "n_ctx": 4096, "temperature": 0.7, "max_tokens": 16000, "top_p": 0.8, "top_k": 40, "verbose": true, "repeat_penalty": 1.1, "template": "You are a chatbot. \n\n{chat_history} \n\nPROMPT: {input} \n\nBOT:" }, "Other-profile": { "Your configuration here ..." } }
-
Configure Master-Key
In order to make requests to the api, you need to configure a Master Key. This master key will help you make API calls and generate new API keys.
- Go to data/api_keys.json and put a secure api key there. Another option we recommend is to use the
generate_api_key()
function withinauth/authentication.py
to generate a secure key and then place it in theapi_keys.json
file.
- Go to data/api_keys.json and put a secure api key there. Another option we recommend is to use the
-
Run the server:
python3 main.py
Sessions are used to store the history of interactions in a conversation locally for later resumption. Some considerations include:
- Sessions will store the number of interactions specified in the
k
parameter of theConversationBufferWindowMemory
in thememory
variable inservices/answer_services.py
in the post function for theAnswer
class. This parameter will cause the last k interactions to be stored in memory.
session = load_session(session_id=session_id)
memory = ConversationBufferWindowMemory(
k=5, # Number of interactions to store in memory
memory_key="chat_history",
chat_memory=session
)
-
If no session is specified in the body of the request, the session will be taken as
None
, and will not be saved locally, so the session will only live the k interactions specified above in memory. -
The session_id can take any numerical value, if it does not exist in
data/sessions.json
on the server side a new session will be generated, if it does exist, it will continue and update that existing session.
- The requests has endpoint:
http://localhost:5000/
by default - See the example in examples folder.
- Get an answer from the LLM in the LLM Local API Server.
- Request body:
{"prompt": "YOUR-PROMPT-HERE", "session_id": id(int)}
- Request header: {"Authorization": api_key}
- Where:
prompt
is the prompt to be sended to the LLMsession_id
is an optional parameter to use an existent session or create a new one.Authorization
is the master key or an api key generated by the master key.
- Get a dict of the existing sessions from the LLM Local API Server.
- Request header: {"Authorization": api_key}
- Get a dict of the available models in the models folder in LLM Local API Server.
- Request header: {"Authorization": api_key}
- Set the model and profile to be used in the LLM Local API Server.
- Request body:
{"model": "gguf_model", "model_config": "profile_name"}
- Request header: {"Authorization": api_key}
- Where:
model
is the name of the model .gguf file in models foldermodel_config
is the name of the profile in config.jsonAuthorization
is the master key or an api key generated by the master key.
- Check the authorization (api key) of the LLM Local API Server.
- Request header: {"Authorization": api_key}
- Generate a new api key for the LLM Local API Server. Requires master key.
- Request header: {"Authorization": master_api_key}