The SSIS-Dispatcher project is a subproject branched from the SSIS(Scalable Serving Inference System for Language Models with NVIDIA MIG) project. It is a served as a serving manager component in the system. SSIS-Dispatcher is capable of receiving model inference requests and luanching inference pod under Knative framework while leveraging GPU sharing features supported my Nvidia Multi-Instance GPU(MIG) or Multi-Process Service (MPS), which allows finegrained unitlization of GPU resources, enhancing system efficiency.
- Check out the K-SSIS Repository, for additional autoscaler or performance monitor support.
- Requires a kubernetes cluster with version > 1.28
- This demo project default runs all knative service, pods on
nthulab
namespace - You should have MIG or MPS kubernetes resource registered on your cluster
- For MIG environment setup, reference the GPU operator documentation
- For MPS setup, recommended Nebuly GPU device plugin
- Run
make setup_knative
k get po -n kourier-system
, check if kourier gateway is runningk get svc -n kourier-system
, check if kourier svc and kourier-internal service is established- You can use
curl <kourier service external ip>
to test kourier external gateway or run a pod on cluster that runscurl http://kourier-internal.kourier-system.svc.cluster.local
to check the in-cluster gateway is operating - Use
kn service list
and find the url for the dispatcher, ex:http://dispatcher.nthulab.192.168.1.10.sslip.io
- If you want to build your own dispatcher image, Run
make build
- Run
make deploy
to deploy your own dispatcher image, runkubectl apply -f https://raw.githubusercontent.com/deeeelin/SSIS-Dispatcher/main-deployment/configuration.yaml
to deploy prebuilt image from main branch
-
Run `kubectl edit configmap dispatcher-config
-
Edit data section to set service namespace, inference image and GPU resource names that applies to your system environment
- The MIG resource defined in node may have the example resource name format below:
nvidia.com/mig-1g.5gb nvidia.com/mig-2g.10gb nvidia.com/mig-3g.20gb nvidia.com/mig-4g.20gb nvidia.com/mig-7g.40gb
- The nebuly MPS resource defined in node may have the example the resource name format below:
nvidia.com/gpu-1gb nvidia.com/gpu-2gb nvidia.com/gpu-3gb nvidia.com/gpu-4gb ... nvidia.com/gpu-30gb nvidia.com/gpu-31gb nvidia.com/gpu-32gb
-
Restart the dispatcher pod to reload configurations (by deleting it)
-
Assume the cluster external ip is unavailable, we make our test using in-cluster ip, which is likely available in most cases
-
Open another terminal window , then :
make forward
- Export your HuggingFace token :
export HF_TOKEN="<Your token>"
- Change Directory to
/test
and install required python package throughpip install -r requirements.txt
- Run
python test.py
to send sample request to Dispatcher
- Make sure you done all steps above.
- You can set custom request through modifying
/test/payload.json
:
{
"token": "What is Deep Learning?",
"par": {
"max_new_tokens": "20"
},
"env": {
"MODEL_ID": "openai-community/gpt2",
"HF_TOKEN": ""
}
}
- Reference for parameters (par): https://huggingface.co/docs/transformers/main_classes/text_generation
- Reference for environment variables (env) : https://huggingface.co/docs/text-generation-inference/main/en/reference/launcher
- Delete all service running
- Run
make clean
to remove dispatcher - Run
make remove_knative
to remove knative