-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Open
Description
Motivation.
This issue tracks potential low-hanging fruit for improving vLLM-compile cold start time. @anijain2305, @BoyuanFeng, and I sat down to look at some traces and noticed some things we can improve.
There are more longer-term projects for improving torch.compile cold start time, but those will probably take a bit to hit.
Proposed Change.
- vLLM's custom bytecode hook seems to take a long time (~7 seconds on llama-3.1-70b model). I'm not sure how much of this is actually needed for runtime execution. We should guard the decompilation step behind an envvar. If VLLM_COMPILE_DEPYF=0 (default), we write out a
transformed_code.py
that has a comment that says "Please set VLLM_COMPILE_DEPYF=1 to populate this file". - In llama-3.1-70b, with piecewise cudagraphs, we split a module into 80 different subgraphs. A lot of these subgraphs are literally the same. However, subgraphs 2-79 (approx) are cache-hitting in fx_graph_cache, but they are cache missing in AOTAutogradCache. This needs some more investigation as to why they are cache missing there.
Feedback Period.
7/2-7/11, but really, anytime until these things are fixed.
CC List.
cc @ProExpertProg @youkaichao @WoosukKwon @jamesjwu @zhxchen17
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
anijain2305 and BoyuanFengProExpertProg and jeejeelee
Metadata
Metadata
Assignees
Type
Projects
Status
To triage
Status
Backlog