Local-LLM-API

Repository for Local LLM API Server solution.

This project aims to facilitate the local deployment of LLM model APIs for various purposes, with a primary focus on optimized CPU execution.

🔧 Setup:

Suported OS: Linux

Easy path: Run the setup.sh file with sudo:

source setup.sh

and then:

sudo ./setup.sh

NOTE: Remember to activate the virtual environment before running the server if it's not already activated.

source .venv/bin/activate

Manual installation:

Setup a virtual environment:

sudo apt install python3.10-venv

python3 -m venv .venv

source .venv/bin/activate

Install the dependencies:

sudo apt-get install gcc

export CC=gcc

sudo apt-get install g++

export CXX=g++

sudo apt-get install make

Install the requirements:

pip install -r requirements.txt

Create a models folder in path:

mkdir models

⚡How to use

Create a folder named "models" in the root path of the project if not exists.
Download a .gguf model (last supported for LlamaCPP):
- Recomended repository: https://huggingface.co/TheBloke Credit: Tom Jobbins (TheBloke)
- The model has to be GGML for llama-cpp-python==0.1.65 ans lesser or GGUF for newer versions.
Put the model in the models folder.

NOTE: Download a model direct in the folder throug console with wget [download_model url]
Configure a profile for a model:

a. Go to data/config.json file b. A profile contains the configuration values of a particular LLM, allowing you to have different profiles set up for various purposes. For instance, a profile for a chatbot looks like:
```
{
"SelectedModel": "Vicuna-13b",
"Vicuna-13b": {
    "model": "vicuna-13b-v1.5.Q5_K_M.gguf",
    "n_ctx": 4096,
    "temperature": 0.7,
    "max_tokens": 16000,
    "top_p": 0.8,
    "top_k": 40,
    "verbose": true,
    "repeat_penalty": 1.1,
    "template": "You are a chatbot. \n\n{chat_history} \n\nPROMPT: {input} \n\nBOT:"
    }
}
```
Where:
- SelectedModel is the profile in use
- Vicuna-13b is the name of the profile
  - model is the name of the model .gguf file in models folder
  - n_ctx determines the maximum context length for the model.
  - temperature controls the randomness of the model's output. Higher values make the output more random, while lower values make it more deterministic.
  - max_tokens sets a limit on the number of tokens in the model's response.
  - top_p is used for nucleus sampling, where the model only considers the most likely tokens that make up a certain portion of the cumulative probability distribution.
  - top_k limits the number of the most likely tokens to consider during response generation. Higher values result in more diversity, but excessive values may lead to incoherent output.
  - verbose: indicates that the model will provide detailed or additional information alongside the responses.
  - repeat_penalty discourages the model from repeating the same tokens in the response.
  - template It's where the system prompt (LLM's mission and features) and variables are located. You can modify this parameter in order to achieve a specific goal for your LLM.
NOTE: Learn more of quantized LLMs here What are Quantized LLMs?

c. To add a new profile, put a new key-value segment after "SelectedModel" or after other profile:
```
{
"SelectedModel": "Vicuna-13b",
"Vicuna-13b": {
    "model": "vicuna-13b-v1.5.Q5_K_M.gguf",
    "n_ctx": 4096,
    "temperature": 0.7,
    "max_tokens": 16000,
    "top_p": 0.8,
    "top_k": 40,
    "verbose": true,
    "repeat_penalty": 1.1,
    "template": "You are a chatbot. \n\n{chat_history} \n\nPROMPT: {input} \n\nBOT:"
    },
"Other-profile": {
    "Your configuration here ..."
    }
}
```
Configure Master-Key

In order to make requests to the api, you need to configure a Master Key. This master key will help you make API calls and generate new API keys.
- Go to data/api_keys.json and put a secure api key there. Another option we recommend is to use the generate_api_key() function within auth/authentication.py to generate a secure key and then place it in the api_keys.json file.
Run the server:

python3 main.py

📖 Sessions

Sessions are used to store the history of interactions in a conversation locally for later resumption. Some considerations include:

Sessions will store the number of interactions specified in the k parameter of the ConversationBufferWindowMemory in the memory variable in services/answer_services.py in the post function for the Answer class. This parameter will cause the last k interactions to be stored in memory.

session = load_session(session_id=session_id)
memory = ConversationBufferWindowMemory(
        k=5, # Number of interactions to store in memory
        memory_key="chat_history", 
        chat_memory=session
        )

If no session is specified in the body of the request, the session will be taken as None, and will not be saved locally, so the session will only live the k interactions specified above in memory.
The session_id can take any numerical value, if it does not exist in data/sessions.json on the server side a new session will be generated, if it does exist, it will continue and update that existing session.

🌐 Requests

The requests has endpoint: http://localhost:5000/ by default
See the example in examples folder.

POST /answer

Get an answer from the LLM in the LLM Local API Server.
Request body: {"prompt": "YOUR-PROMPT-HERE", "session_id": id(int)}
Request header: {"Authorization": api_key}
Where:
- prompt is the prompt to be sended to the LLM
- session_id is an optional parameter to use an existent session or create a new one.
- Authorization is the master key or an api key generated by the master key.

GET /answer

Get a dict of the existing sessions from the LLM Local API Server.
Request header: {"Authorization": api_key}

GET /models

Get a dict of the available models in the models folder in LLM Local API Server.
Request header: {"Authorization": api_key}

POST /models/model

Set the model and profile to be used in the LLM Local API Server.
Request body: {"model": "gguf_model", "model_config": "profile_name"}
Request header: {"Authorization": api_key}
Where:
- model is the name of the model .gguf file in models folder
- model_config is the name of the profile in config.json
- Authorization is the master key or an api key generated by the master key.

GET /auth

Check the authorization (api key) of the LLM Local API Server.
Request header: {"Authorization": api_key}

POST /auth/key

Generate a new api key for the LLM Local API Server. Requires master key.
Request header: {"Authorization": master_api_key}

🖊️ Authors

Agustín Montaña - GitHub
Bruno Orbelli - GitHub
Francisco Espinola - GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
auth		auth
data		data
examples		examples
services		services
utils		utils
website		website
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Local-LLM-API

🔧 Setup:

⚡How to use

📖 Sessions

🌐 Requests

POST /answer

GET /answer

GET /models

POST /models/model

GET /auth

POST /auth/key

🖊️ Authors

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Agustinm28/Local-LLM-API

Folders and files

Latest commit

History

Repository files navigation

Local-LLM-API

🔧 Setup:

⚡How to use

📖 Sessions

🌐 Requests

POST /answer

GET /answer

GET /models

POST /models/model

GET /auth

POST /auth/key

🖊️ Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages