diff --git a/docker/README.md b/docker/README.md new file mode 100644 index 0000000000..b1d0248bb3 --- /dev/null +++ b/docker/README.md @@ -0,0 +1,61 @@ +## Containers for mlc_llm REST API and instant command line interface (CLI) chat with LLM + +A set of docker container templates for your scaled production deployment of GPU-accelerated mlc_llms. Based on the recent work on the SLM, jit flow, and OpenAI compatible APIs including function calling. + +These containers are designed to be: + +* minimalist - nothing non-essential is included; you can layer on your own security policy for example +* non-opinionated - use CNCF k8s or docker compose or swarm or whatever you have for orchestration +* adaptive and composable - nobody knows what you intend to do with these containers, and we don't guess +* compatible - with multi-GPU support maturing and batching still in testing, these containers should suvive upcoming changes without needing to be severely revamped. +* practical NOW - usable and deployable TODAY with 2024/2025 level workstation/consumer hardware and mlc-ai + +### Structure + +Base containers are segregated by GPU acceleration stacks. See the README of the sub-folders for more information +``` +cuda +|-- cuda122 + + +rocm +|-- rocm57 + +bin + +test +``` + +The `bin` folder has the template executables that will start the containers. + +The `test` folder contains the tests. + +#### Community contribution + +This structure enables the greater community to easily contribute new tested templates for other cuda and rocm releases, for example. + +#### Greatly enhanced out-of-box UX + +Managing the huge physical size of the weights for an LLM model is a major hurdle when deploying modern LLM in production / experimental environment at any scale. Couple this with the need to compile NN network support library for every combination and permutation of GPU hardware vs OS supported - and an _impossibly frustrating_ out-of-box user experience is guaranteed. + +The latest improvement in JIT and SLM flow for MLC_LLM specifically addresses this. And these docker container templates further enhances the out-of-box UX, down to one single easy to use command line (with automatic cached LLM weights management). + +Users of such images can simply decide to run "llama2 7b on cuda 12.2" and in one single command immediately pull down an image onto their workstation running AI apps served by Llama 2 already GPU accelerated. The weights are downloaded directly from huggingface and converted _specifically for her/his GPU hardware and OS_ the first time the command is executed; any subsequent invocation can start _instantly_ using the already converted weights. + +As an example the command to start an interactive chat with this LLM on Cuda 1.22 accelerated Linux is: + +``` +startcuda122chat.sh Llama-2-7b-chat-hf-q4f32_1 +``` + +One container template is supplied for REST API serving, and another one is available for interactive command line chat with any supported LLMs. + + +##### Compatibility with future improvements + +There is no loss of flexibility in using these containers, the REST API implementation already support batching - the ability to handle multiple concurrent inferences at the same time. And any future improvements in MLC_AI will be + +#### Tests + +Tests are made global as they apply to mlc_ai running across any supported GPU configurations. + diff --git a/docker/bin/README.md b/docker/bin/README.md new file mode 100644 index 0000000000..00df9921dc --- /dev/null +++ b/docker/bin/README.md @@ -0,0 +1,26 @@ +### container startup scripts + + +> NOTE: Please make sure you are in the `bin` directory when starting these scripts. Make sure you have write permissions to the `cache` directory. These scripts make use of the `cache` folder there to cache all model weights and custom compiled libraries. + +The `` that are supported (at any time) can be obtained from [MLC AI's Huggingface Repo](https://huggingface.co/mlc-ai). There are *88 supported models at the time of writing* and soon will have hundreds more. + +![image](https://github.com/Sing-Li/dockertest/assets/122633/e1068b42-cfe1-4385-8c71-0791d2987d8b) + +Some currently popular `model names` that our community are actively exploring include: + +* `Llama-2-7b-chat-hf-q4f16_1` +* `Mistral-7B-Instruct-v0.2-q4f16_1` +* `gemma-7b-it-q4ff16_2` +* `phi-1_5_q4f32_1` + +Try using these `` when parameterizing the scripts. + +You can modify the `serve` scripts directly to support specific network interfaces (on a multi-homed system, defaults to `0.0.0.0` = all interfaces) and to change the listening port (defaults to port `8000`). + +|Command | Description | Usage| +|-------|------|------| +|`startcuda122chat.sh` | starts up command line interactive chat with specified LLM on Cuda 12.1 linux system | `sh ./startcuda122chat.sh `| +|`startcuda122serve.sh` | runs a server handling multiple concurrent REST API calls to the specified LLM on Cuda 12.1 linux system| `sh ./startcuda122serve.sh `| +|`startrocm57chat.sh` | starts up command line interactive chat with specified LLM on Rocm 5.7 linux system | `sh ./startrocm57chat.sh `| +|`startrocm57serve.sh` | runs a server handling multiple concurrent REST API calls to the specified LLM on Rocm 5.7 linux system| `sh ./startrocm57serve.sh `| diff --git a/docker/bin/startcuda122chat.sh b/docker/bin/startcuda122chat.sh new file mode 100755 index 0000000000..87c260b962 --- /dev/null +++ b/docker/bin/startcuda122chat.sh @@ -0,0 +1 @@ +docker run --gpus all --rm -it --network host -v ./cache:/root/.cache mlcllmcuda122:v0.1 chat HF://mlc-ai/$1-MLC diff --git a/docker/bin/startcuda122serve.sh b/docker/bin/startcuda122serve.sh new file mode 100755 index 0000000000..6395eb1c23 --- /dev/null +++ b/docker/bin/startcuda122serve.sh @@ -0,0 +1 @@ +docker run --gpus all --rm --network host -v ./cache:/root/.cache mlcllmcuda122:v0.1 serve HF://mlc-ai/$1-MLC --host 0.0.0.0 --port 8000 diff --git a/docker/bin/startrocm57chat.sh b/docker/bin/startrocm57chat.sh new file mode 100755 index 0000000000..11ec324f1d --- /dev/null +++ b/docker/bin/startrocm57chat.sh @@ -0,0 +1 @@ +docker run --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --rm --network host -v ./cache:/root/.cache mlcllmrocm57:v0.1 chat HF://mlc-ai/$1-MLC diff --git a/docker/bin/startrocm57serve.sh b/docker/bin/startrocm57serve.sh new file mode 100755 index 0000000000..818c6ddb09 --- /dev/null +++ b/docker/bin/startrocm57serve.sh @@ -0,0 +1 @@ +docker run --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --rm --network host mlcllmrocm57:v0.1 serve HF://mlc-ai/$1-MLC --host 0.0.0.0 --port 8000 diff --git a/docker/cuda/cuda122/.dockerignore b/docker/cuda/cuda122/.dockerignore new file mode 100644 index 0000000000..e69de29bb2 diff --git a/docker/cuda/cuda122/Dockerfile b/docker/cuda/cuda122/Dockerfile new file mode 100644 index 0000000000..95e387b9a8 --- /dev/null +++ b/docker/cuda/cuda122/Dockerfile @@ -0,0 +1,22 @@ +FROM nvidia/cuda:12.2.2-devel-ubuntu22.04 + +ENV MLC_PATH /mlcllm + +# setup python 3 and pip, load the mlc-ai nightlies + +RUN apt update && \ + apt install --yes python3.11 pip git git-lfs && \ + pip install --pre -U -f https://mlc.ai/wheels \ + mlc-llm-nightly-cu122 mlc-ai-nightly-cu122 &&\ + mkdir -p $MLC_PATH + +VOLUME ${MLC_PATH} + +WORKDIR ${MLC_PATH} + + +ENTRYPOINT ["mlc_llm"] + +CMD ["chat", "HF://mlc-ai/Llama-2-7b-chat-hf-q4f32_1-MLC"] + + diff --git a/docker/cuda/cuda122/README.md b/docker/cuda/cuda122/README.md new file mode 100644 index 0000000000..94251b8107 --- /dev/null +++ b/docker/cuda/cuda122/README.md @@ -0,0 +1,7 @@ +## Base mlc_llm docker image for Cuda 12.2 systems + +Make sure you perform: + +`sh ./buildimage.sh` + +This will build the base docker image for Cuda 12.2, from the latest nightly. The resulting image will be on your local registry, you can further push the image to any deployment registry. The image size will be very large (about 18.4GB) since it includes all cuda toolkit and support libraries, diff --git a/docker/cuda/cuda122/buildimage.sh b/docker/cuda/cuda122/buildimage.sh new file mode 100755 index 0000000000..f58d3278ee --- /dev/null +++ b/docker/cuda/cuda122/buildimage.sh @@ -0,0 +1,3 @@ +docker build --no-cache -t mlcllmcuda122:v0.1 -f ./Dockerfile . + + diff --git a/docker/rocm/rocm57/Dockerfile b/docker/rocm/rocm57/Dockerfile new file mode 100644 index 0000000000..d917c268a8 --- /dev/null +++ b/docker/rocm/rocm57/Dockerfile @@ -0,0 +1,21 @@ +# NOTE: This Dockerfile is based on ROCm 5.7 +FROM rocm/dev-ubuntu-22.04:5.7-complete + +ENV MLC_PATH /mlcllm + +# setup python 3 and pip, load the mlc-ai nightlies + +RUN apt update && \ + apt install --yes python3.11 pip git git-lfs && \ + pip install --pre -U -f https://mlc.ai/wheels \ + mlc-llm-nightly-rocm57 mlc-ai-nightly-rocm57 &&\ + mkdir -p $MLC_PATH + +VOLUME ${MLC_PATH} + +WORKDIR ${MLC_PATH} + +ENTRYPOINT ["mlc_llm"] + +CMD ["chat", "HF://mlc-ai/Llama-2-7b-chat-hf-q4f32_1-MLC"] + diff --git a/docker/rocm/rocm57/README.md b/docker/rocm/rocm57/README.md new file mode 100644 index 0000000000..1613c34589 --- /dev/null +++ b/docker/rocm/rocm57/README.md @@ -0,0 +1,7 @@ +## Base mlc_llm docker image for ROCm 5.7 systems + +Make sure you perform: + +`sh ./buildimage.sh` + +This will build the base docker image for ROCm 5.7, from the latest nightly. The resulting image will be on your local registry, you can further push the image to any deployment registry. The image size will be very large (about 28.1GB) since it includes all ROCm toolkit and support libraries, diff --git a/docker/rocm/rocm57/buidimage.sh b/docker/rocm/rocm57/buidimage.sh new file mode 100644 index 0000000000..9fd401d819 --- /dev/null +++ b/docker/rocm/rocm57/buidimage.sh @@ -0,0 +1 @@ +docker build --no-cache -t mlcllmrocm57:v0.1 -f ./Dockerfile . diff --git a/docker/test/README.md b/docker/test/README.md new file mode 100644 index 0000000000..c8584a1857 --- /dev/null +++ b/docker/test/README.md @@ -0,0 +1,8 @@ +## Tests for mlc_llm `serve` + +The simple test programs for REST API serving (including function calling / tools pattern) when using mlc_llm `serve` with any of the 100s of supported models. + +|Test name|Description| +|------------|---------------| +|`sample_client_for-testing.py`|Calls completion REST API once without streaming, and then again with streaming, and displays the output. Make sure you modify the `payload` LLM name field to match the actually LLM you are testing.| +|`functionall.py`|Actual function calling example utilizing OpenAI compatible API _tools_ field. Make sure you modify the `payload` LLM name field to match the actually LLM you are testing. This example will only work with models fine-tuned for function calling, including many Mixtral/Mistral derivatives.| diff --git a/docker/test/functioncall.py b/docker/test/functioncall.py new file mode 100644 index 0000000000..41faead01b --- /dev/null +++ b/docker/test/functioncall.py @@ -0,0 +1,41 @@ +import requests +import json + +tools = [ + { + "type": "function", + "function": { + "name": "get_current_weather", + "description": "Get the current weather in a given location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city and state, e.g. San Francisco, CA", + }, + "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, + }, + "required": ["location"], + }, + }, + } +] + +payload = { + "model": "HF://mlc-ai/gorilla-openfunctions-v2-q4f16_1-MLC", +# "model": "HF://mlc-ai/gemma-2b-it-q4f16_1-MLC", + "messages": [ + { + "role": "user", + "content": "What is the current weather in Pittsburgh, PA in fahrenheit?", + } + ], + "stream": False, + "tools": tools, +} + +r = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=payload) +print(f"{r.json()['choices'][0]['message']['tool_calls'][0]['function']}\n") + +# Output: {'name': 'get_current_weather', 'arguments': {'location': 'Pittsburgh, PA', 'unit': 'fahrenheit'}} diff --git a/docker/test/sample_client_for-testing.py b/docker/test/sample_client_for-testing.py new file mode 100644 index 0000000000..4466a3fe43 --- /dev/null +++ b/docker/test/sample_client_for-testing.py @@ -0,0 +1,45 @@ +import requests +import json + +class color: + PURPLE = '\033[95m' + CYAN = '\033[96m' + DARKCYAN = '\033[36m' + BLUE = '\033[94m' + GREEN = '\033[92m' + YELLOW = '\033[93m' + RED = '\033[91m' + BOLD = '\033[1m' + UNDERLINE = '\033[4m' + END = '\033[0m' + +# Get a response using a prompt without streaming +payload = { +# "model": "HF://mlc-ai/gemma-2b-it-q4f16_1-MLC", + "model": "HF://mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC", + "messages": [{"role": "user", "content": "write a haiku"}], + "stream": False +} +r = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=payload) +print(f"{color.BOLD}Without streaming:{color.END}\n{color.GREEN}{r.json()['choices'][0]['message']['content']}{color.END}\n") + + +payload = { +# "model": "HF://mlc-ai/gemma-2b-it-q4f16_1-MLC", + "model": "HF://mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC", + "messages": [{"role": "user", "content": "Write a 500 words essay about the civil war"}], + "stream": True +} + +print(f"{color.BOLD}With streaming:{color.END}") +with requests.post("http://127.0.0.1:8000/v1/chat/completions", json=payload, stream=True) as r: + for chunk in r.iter_content(chunk_size=None): + chunk = chunk.decode("utf-8") + if "[DONE]" in chunk[6:]: + break + response = json.loads(chunk[6:]) + content = response["choices"][0]["delta"].get("content", "") + print(f"{color.GREEN}{content}{color.END}", end="", flush=True) + +print("\n") +