This project is for the NYCU Edge AI final, focusing on LLM quantization and performance evaluation with SGLang server.
git clone https://github.com/KeithLin724/NYCU_Edge_AI_SGLang.gitMake sure you have Conda installed. Create the environment with:
conda env create -f environment.yml
conda activate edge_ai_sglang_stableTo install CUDA Toolkit (includes nvcc) on Ubuntu 22.04, run:
# Download the CUDA repository pin file for package priority
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
# Download and install the CUDA repository local installer
wget https://developer.download.nvidia.com/compute/cuda/12.9.0/local_installers/cuda-repo-ubuntu2204-12-9-local_12.9.0-575.51.03-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-9-local_12.9.0-575.51.03-1_amd64.deb
# Add the CUDA GPG key to your system keyring
sudo cp /var/cuda-repo-ubuntu2204-12-9-local/cuda-*-keyring.gpg /usr/share/keyrings/
# Update package lists
sudo apt-get update
# Install CUDA Toolkit 12.9 (includes nvcc)
sudo apt-get -y install cuda-toolkit-12-9You can quickly get started by using the pre-built, quantized model.
Simply run the following command to automatically download and load the stable model for your experiments:
sh run_server.shIf you prefer to build the model yourself (e.g., for custom training or quantization),
please refer to the detailed instructions in BUILD_MODEL.md.
-
Start the SG-Lang server:
# Start the SG-Lang server with the default pre-built model (KYLiN724/llama-3.2-1b-KD-V1-W8A8-Dynamic-Per-Token) sh run_server.sh # Or, specify a custom model path or Hugging Face repo # sh run_server.sh <model_name_or_path>
-
Run the throughput test script:
# Run throughput test with the default pre-built model python result-quant-sglang.py # Or, specify a custom model path or Hugging Face repo # python result-quant-sglang.py --model_name <model_name_or_path>
Note: Please shut down the SG-Lang server before running this step.
# Run perplexity (PPL) test with the default pre-built model
python result-quant.py
# Or, specify a custom model path or Hugging Face repo
# python result-quant.py --model_name <model_name_or_path>- Experiment results will be saved as
result_tput.csv(throughput) andresult_ppl.csv(perplexity). - You can adjust model or dataset parameters at the top of each script to suit your needs.
- If you encounter CUDA out-of-memory errors:
- Try reducing the batch size or sequence length.
- You can also tune server flags in
run_server.shfor better memory management. See the SG-Lang hyperparameter tuning guide for more details.
For any questions, please open an issue or contact the project maintainer.
meta-llama/Llama-3.2-3B-Instruct, meta-llama/Llama-3.2-1B-Instruct