Skip to content

[RFC]: vLLM-compile low-hanging fruit cold start improvements #20451

@zou3519

Description

@zou3519

Motivation.

This issue tracks potential low-hanging fruit for improving vLLM-compile cold start time. @anijain2305, @BoyuanFeng, and I sat down to look at some traces and noticed some things we can improve.

There are more longer-term projects for improving torch.compile cold start time, but those will probably take a bit to hit.

Proposed Change.

  • vLLM's custom bytecode hook seems to take a long time (~7 seconds on llama-3.1-70b model). I'm not sure how much of this is actually needed for runtime execution. We should guard the decompilation step behind an envvar. If VLLM_COMPILE_DEPYF=0 (default), we write out a transformed_code.py that has a comment that says "Please set VLLM_COMPILE_DEPYF=1 to populate this file".
  • In llama-3.1-70b, with piecewise cudagraphs, we split a module into 80 different subgraphs. A lot of these subgraphs are literally the same. However, subgraphs 2-79 (approx) are cache-hitting in fx_graph_cache, but they are cache missing in AOTAutogradCache. This needs some more investigation as to why they are cache missing there.

Feedback Period.

7/2-7/11, but really, anytime until these things are fixed.

CC List.

cc @ProExpertProg @youkaichao @WoosukKwon @jamesjwu @zhxchen17

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    To triage

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions