Skip to content

[RFC]: vLLM-compile (minus cudagraphs) warm-start time should be close to zero #20402

Open
@zou3519

Description

@zou3519

Motivation.

@BoyuanFeng did some benchmarks of vLLM cold vs warm start of a 70B model. In the warm start, compilation (ignoring cudagraphs) took 25 out of 132 seconds, almost 20% of the time. On warm start, all of the hard work (compiling artifacts) should have been already done.

The theoretical minimum amount of time that vLLM-compile needs to spend in warm start is the amount of time it takes to load all the compiled code.

Image

Proposed Change.

The following categories correspond to what is in the chart above.

Dynamo:

  • On warm start, vLLM always re-runs Dynamo. We don't need to do this: instead, we can directly serialize the bytecode that Dynamo produces and re-load it.
  • Originally I was planning on waiting until torch.compile implemented "precompilation", which will skip Dynamo on warm start. It might be worth figuring out how to get a simpler version of this into vLLM, especially because "precompilation" in torch is still a bit away. vLLM just needs to serialize the Dynamo-produced bytecode; we don't care about graph breaks or guards.

Inductor:

  • TL;DR: vLLM is doing some compute on loading the compiled artifact. It shouldn't need to do this compute. We should be able to fix this in vLLM
  • Details: With piecewise cudagraphs, there are N compiled artifacts. The way vLLM loads the compiled artifacts is that we do a full forward-pass through the model, using FakeTensors. When the forward pass hits one of these "missing compiled artifacts", then it loads it from disk.
  • We don't need to run the full forward pass. The full forward pass on FakeTensors is slow. it should be possible to record all of the compiled artifacts we need to load and just load them all together and construct the right objects for runtime.

Other: this needs some more investigation.

Feedback Period.

7/2 - 7/11

CC List.

@ProExpertProg @youkaichao @WoosukKwon @robertgshaw2-redhat @jamesjwu @zhxchen17

Any Other Things.

thank you

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    In progress

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions