Skip to content

KeithLin724/NYCU_Edge_AI_SGLang

Repository files navigation

NYCU Edge AI Final: SGLang

This project is for the NYCU Edge AI final, focusing on LLM quantization and performance evaluation with SGLang server.

Linux Anaconda Python PyTorchnVIDIA


Download Repo

git clone https://github.com/KeithLin724/NYCU_Edge_AI_SGLang.git

Environment Setup

Make sure you have Conda installed. Create the environment with:

conda env create -f environment.yml
conda activate edge_ai_sglang_stable

CUDA & NVCC Installation

To install CUDA Toolkit (includes nvcc) on Ubuntu 22.04, run:

# Download the CUDA repository pin file for package priority
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600

# Download and install the CUDA repository local installer
wget https://developer.download.nvidia.com/compute/cuda/12.9.0/local_installers/cuda-repo-ubuntu2204-12-9-local_12.9.0-575.51.03-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-9-local_12.9.0-575.51.03-1_amd64.deb

# Add the CUDA GPG key to your system keyring
sudo cp /var/cuda-repo-ubuntu2204-12-9-local/cuda-*-keyring.gpg /usr/share/keyrings/

# Update package lists
sudo apt-get update

# Install CUDA Toolkit 12.9 (includes nvcc)
sudo apt-get -y install cuda-toolkit-12-9

How to Run Experiments

1. Prepare the Model

Option 1: Download Pre-built Model

You can quickly get started by using the pre-built, quantized model.
Simply run the following command to automatically download and load the stable model for your experiments:

sh run_server.sh

Option 2: Build the Model from Scratch

If you prefer to build the model yourself (e.g., for custom training or quantization),
please refer to the detailed instructions in BUILD_MODEL.md.


2. Throughput Test

  1. Start the SG-Lang server:

    # Start the SG-Lang server with the default pre-built model (KYLiN724/llama-3.2-1b-KD-V1-W8A8-Dynamic-Per-Token)
    sh run_server.sh
    
    # Or, specify a custom model path or Hugging Face repo
    # sh run_server.sh <model_name_or_path>
  2. Run the throughput test script:

    # Run throughput test with the default pre-built model
    python result-quant-sglang.py
    
    # Or, specify a custom model path or Hugging Face repo
    # python result-quant-sglang.py --model_name <model_name_or_path>

3. Perplexity (PPL) Test

Note: Please shut down the SG-Lang server before running this step.

# Run perplexity (PPL) test with the default pre-built model
python result-quant.py

# Or, specify a custom model path or Hugging Face repo
# python result-quant.py --model_name <model_name_or_path>

Notes

  • Experiment results will be saved as result_tput.csv (throughput) and result_ppl.csv (perplexity).
  • You can adjust model or dataset parameters at the top of each script to suit your needs.
  • If you encounter CUDA out-of-memory errors:
    • Try reducing the batch size or sequence length.
    • You can also tune server flags in run_server.sh for better memory management. See the SG-Lang hyperparameter tuning guide for more details.

For any questions, please open an issue or contact the project maintainer.

Model

meta-llama/Llama-3.2-3B-Instruct, meta-llama/Llama-3.2-1B-Instruct

Reference

About

NYCU Edge AI Final Project Using SGLang

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •