Skip to content

[Bug]: Compile inductor / CUDA Graph build before the memory profiling #19480

@houseroad

Description

@houseroad

Your current environment

Running Llama4 Maverick on H100x8

🐛 Describe the bug

Otherwise, it's easy to get OOM. Inductor and CUDA graph themselves may consume a lot of memory, especially, inductor may leverage some profiling to search the best config for the kernels.

export LLAMA_DIR=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8; export PORT=8081 VLLM_LOGGING_LEVEL=DEBUG VLLM_DISABLE_COMPILE_CACHE=1 SAFETENSORS_FAST_GPU=1 vllm serve $LLAMA_DIR --disable-log-requests -tp 8 --host :: --port $PORT --served-model-name default --no-enable-prefix-caching --max-model-len 4096 --gpu-memory-utilization 0.8 2>&1 | tee marverik_fp8_no_compile.log

If we use 0.9 or 0.95, it's easy to reproduce the issue on H100x8 machines.
0.8 may be okay.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    To triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions