-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Open
Labels
Description
Your current environment
Running Llama4 Maverick on H100x8
🐛 Describe the bug
Otherwise, it's easy to get OOM. Inductor and CUDA graph themselves may consume a lot of memory, especially, inductor may leverage some profiling to search the best config for the kernels.
export LLAMA_DIR=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8; export PORT=8081 VLLM_LOGGING_LEVEL=DEBUG VLLM_DISABLE_COMPILE_CACHE=1 SAFETENSORS_FAST_GPU=1 vllm serve $LLAMA_DIR --disable-log-requests -tp 8 --host :: --port $PORT --served-model-name default --no-enable-prefix-caching --max-model-len 4096 --gpu-memory-utilization 0.8 2>&1 | tee marverik_fp8_no_compile.log
If we use 0.9 or 0.95, it's easy to reproduce the issue on H100x8 machines.
0.8 may be okay.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
yeqcharlotte
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
To triage