Skip to content

slawo/gollamas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gollamas

A per model "reverse proxy" which redirects requests to multiple ollama servers.

Latest Stable Version Licence CI status docker hub

This is a reverse proxy for ollama, it accepts mainly chat and generation requests, it reads requests and transfers the payload to a server which has been specifically assigned to run the model reffered to in the request. Reffer to API for a list of endpoints currently supported.

run binary

Binaries are automatically compiled and made available in the latest github release.

gollamas --level=warn \
    --listen 0.0.0.0:11434 
    --proxy=tinyllama=http://server-01:11434 \
    --proxy=llama3.2-vision=http://server-01:11434 \
    --proxy=deepseek-r1:14b=http://server-02:11434

run on docker

Images are automatically built for amd64, arm, arm64, riscv64, s390x and ppc64le. Issues for other architectures are welcome.

Official images are automaticaly made available on docker hub and ghcr.io. You can run the latest image from either.

from docker hub

The main images are on docker hub.

docker run -it \
  -e GOLLAMAS_PROXIES="llama3.2-vision=http://server:11434,deepseek-r1:14b=http://server2:11434" \
  slawoc/gollamas:latest

github

Alternatively images are published to ghcr.io.

docker run -it \
  -e GOLLAMAS_PROXIES="llama3.2-vision=http://server:11434,deepseek-r1:14b=http://server2:11434" \
  ghcr.io/slawo/gollamas:latest

run code locally

go run ./*.go --level=trace \
    --listen 0.0.0.0:11434 
    --proxy=tinyllama=http://server-02:11434 \
    --proxy=llama3.2-vision=http://server-02:11434 \
    --proxy=deepseek-r1:14b=http://server-01:11434

kubernetes

Example of a kube deployment.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gollamas
  namespace: ai
spec:
  replicas: 3
  selector:
    matchLabels:
      name: gollamas
  template:
    metadata:
      labels:
        name: gollamas
    spec:
      containers:
        - name: gollamas
          image: slawoc/gollamas:latest
          ports:
            - name: http
              containerPort: 11434
              protocol: TCP
          env:
            - name: GOLLAMAS_LISTEN
              value: 0.0.0.0:11434
            - name: GOLLAMAS_PROXIES
              value: qwen2.5-coder:14b=http://ollama.ai.svc.cluster.local,gemma3:12b=http://f-01-ai.example.com:11434,llama3.2-vision=http://f-02-ai.example.com:11434
            - name: GOLLAMAS_ALIASES
              value: ""
            - name: GOLLAMAS_LIST_ALIASES
              value: "true"
          resources:
            requests:
              cpu: 100m
              memory: 64Mi
            limits:
              cpu: 500m
              memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
  name: gollamas
  namespace: ai
spec:
  type: LoadBalancer
  selector:
    name: gollamas
  ports:
    - port: 80
      name: http
      targetPort: http
      protocol: TCP

Usage

parameters

The existing flags should remain fairly stable going forward, if flags are to be renamed best effort will be made to keep both the new name and old name as well as existing behaviour until final release.

Flag Env var Description
--listen "GOLLAMAS_LISTEN", "LISTEN" address on which the router will be listening on, ie: "localhost:11434"
--proxy value assigns a destination for a model, can be a url or a connection id ex: --proxy 'llama3.2-vision=http://server:11434' ex: --proxy 'llama3.2-vision=c1 --connection c1=http://server:11434'
--proxies value "GOLLAMAS_PROXIES" "PROXIES" assigns destinations for the models, in the list of model=destination pairs ex: --proxies 'llama3.2-vision=http://server:11434,deepseek-r1:14b=http://server2:11434'
--connection value assigns an identifier to a connection which can be reffered to by proxy declarations ex: --connection c1=http://server:11434 --proxy llama=c1
--connections value "GOLLAMAS_CONNECTIONS" "CONNECTIONS" provides a list of connections which can be reffered to by id ex: --connections c1=http://server:11434,c2=http://server2:11434
--alias value assigns an alias from an existing model name passed in the proxy configuration 'alias=concrete_model' ex: --alias gpt-3.5-turbo=llama3.2
--aliases value "GOLLAMAS_ALIASES", "ALIASES" sets aliases for the given model names ex: --aliases 'gpt-3.5-turbo=llama3.2,deepseek=deepseek-r1:14b'
--list-aliases "GOLLAMAS_LIST_ALIASES" "LIST_ALIASES" show aliases which match a model when listing models

plural flags

You should use the singular flags --alias, --connection and --proxy vs providing a coma separated list to plural flags like --aliases, --connections and --proxies. Usage of the plural flags is discouraged, those flags have been added as a temporary solution to permit passing the associated environment variables in docker containers. Those flags might be removed in future versions while the environemt variables will be retained.

Setting both sigular flags and plural ones will not result in errors but will result in undefined behaviour which can change with future versions. Use only one type of flags, preferably the singular versions.

environment variables

For each option you can set either the flags or the environment variables, setting both will result in undefined behavior which can change with future versions.

Use the GOLLAMAS_ prefixed environment variables.

connections

You can asign ids to connections like so --connection CID1=http://main-ai:11434 --connection CID2=http://mini-ai-01:11434 and reffer to each connection by id when listing the models to be proxied --proxy deepseek-r1:70b=CID1 --proxy tinyllama=CID2.

When a connection is given an id the the ID will be used instead of the url string in any responses or logs

Since 0.4.1 when multiple models are proxied to the same URL only one connection will be created for that url.It is still possible to create 2 connections on the same URL using the --connection flag (--connection C1=http://server1 --connection C2=http://server1).

Features

There are various scenarios this projects attempts to resolve, here is a list of features currently implemented and being considered for implementation:

Usecases

  • Manage models
    • Map model aliases to existing model names (some tools only allow a pre-defined set of models)
    • Set that by default only the configured models are returned when listing models
    • Set a flag to also return models as aliases
    • Set option to allow requests to currently running models (ie server has additional model running)
  • Allow access to models currently running on an instance #19
  • Allow multiple routes to a given model #20
  • preload/keep models in memory #22
    • Preload models (ensure model is loaded uppon startup)
    • Ping models (maintain model loaded)
    • Add config to enforce model keep alive globally "keep_alive": -1 (if it is worth adding functionality for servers without OLLAMA_KEEP_ALIVE=-1)
    • Add config to override model keep alive per model/server "keep_alive": -1
  • Enable fixed context size for models #21
    • Add config to set a default context size (if missing) in each request "options": { "num_ctx": 4096 }
    • Add config to set a default context size (if missing) per model/server "options": { "num_ctx": 4096 }
    • Add config to enforce context size in each request "options": { "num_ctx": 4096 }
    • Add config to enforce context size per model/server "options": { "num_ctx": 4096 }

API

Not all endpoints are covered, particularly endpoints which deal with customisation and creation of models are not supported until there is a clear usecase for this.

  • Supported endpoints

    • GET /
    • GET /api/tags
    • GET /api/ps
    • GET /api/version
    • GET /v1/models
    • GET /v1/models/:model
    • HEAD /
    • HEAD /api/tags
    • HEAD /api/version
    • POST /api/chat
    • POST /api/embed
    • POST /api/embeddings
    • POST /api/generate
    • POST /api/pull
    • POST /api/show
    • POST /v1/chat/completions
    • POST /v1/completions
    • POST /v1/embeddings
  • Not supported

    • DELETE /api/delete
    • HEAD /api/blobs/:digest
    • POST /api/blobs/:digest
    • POST /api/copy
    • POST /api/create
    • POST /api/push

Internals

The server relies on existing ollama models and middlewares to speed up the development of the initial implementation. Only the requests which have a model ( or the deprecated name) field are transfered to the right server.

When possible other endpoints hit all configured servers to either select one answer (ie: the lowest version available), or are combined and processed into one response (ie: lists of models).

About

A per model "reverse proxy" which redirects requests to multiple ollama servers.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 3

  •  
  •  
  •