A very simple server to run guidance programs over http.
Supports health checking and reflection.
Run guidance programs over http in a reliable and performant way.
- Run simple programs consisting of
gen
+ prompt text - Streaming
- Logging (no idea why this is not working.)
- Error handling
- Guidance programs with
async
steps
- Support non-hugging-face models (including openai)
- Support windows (use wsl/docker/podman)
- CPU support (fixes going this direction are fine, it should not add complexity)
- Improving my awful python
- Inprove Dockerfile
- Add docker examples
- Bug fixes
- Documentation
- Tests
- Performance improvements (startup speed on larger models is a big one)
- Increasing the number of guidance programs that can be run
podman run -e MODEL_NAME=gpt2 -p 50051:50051 --init --device=nvidia.com/gpu=all ghcr.io/utilityai/guidance-rpc:latest
Required poetry to be installed.
poetry install
poetry run python src/main.py
This should work almost 1-1 with docker
- the
device
flag inrun
may be different - the suffix
,z
on the--mount
will not be required
podman run \
-p 50051:50051 \
-e MODEL_NAME=meta-llama/Llama-2-7b-hf \
-e HF_TOKEN=hf_aaaaaaaaaaaaaaaaaaaaaaaaaa \
--mount type=bind,src=$XDG_CONFIG_HOME/.cache/huggingface,dst=/root/.cache/huggingface,z \
--init \
--device=nvidia.com/gpu=all \
ghcr.io/utilityai/guidance-rpc:latest
podman build -t guidance-rpc .
podman run \
-p 50051:50051 \
-e MODEL_NAME=TheBloke/Llama-2-7b-Chat-GPTQ \
-e CACHE=False \
--mount type=bind,src=$HOME/.cache/huggingface,dst=/root/.cache/huggingface,z \
--init \
--device=nvidia.com/gpu=all \
guidance-rpc
See Acceptable Contributions and Non-Goals above.
Generate grpc files with
python -m grpc_tools.protoc -I protos --python_out=src --pyi_out=src --grpc_python_out=src protos/guidance.proto
If you update dependencies, run
poetry update
and then
poetry export -f requirements.txt --output requirements.txt