[RFC]: vLLM-compile (minus cudagraphs) warm-start time should be close to zero

### Motivation.

@BoyuanFeng did some benchmarks of vLLM cold vs warm start of a 70B model. In the warm start, compilation (ignoring cudagraphs) took 25 out of 132 seconds, almost 20% of the time. On warm start, all of the hard work (compiling artifacts) should have been already done.

The theoretical minimum amount of time that vLLM-compile needs to spend in warm start is the amount of time it takes to load all the compiled code.

![Image](https://github.com/user-attachments/assets/b34204f8-5ad5-49d4-bdc6-6805610ac6be)

### Proposed Change.

The following categories correspond to what is in the chart above.

Dynamo:
- On warm start, vLLM always re-runs Dynamo. We don't need to do this: instead, we can directly serialize the bytecode that Dynamo produces and re-load it.
- Originally I was planning on waiting until torch.compile implemented "precompilation", which will skip Dynamo on warm start. It might be worth figuring out how to get a simpler version of this into vLLM, especially because "precompilation" in torch is still a bit away. vLLM just needs to serialize the Dynamo-produced bytecode; we don't care about graph breaks or guards.

Inductor:
- TL;DR: vLLM is doing some compute on loading the compiled artifact. It shouldn't need to do this compute. We should be able to fix this in vLLM
- Details: With piecewise cudagraphs, there are N compiled artifacts. The way vLLM loads the compiled artifacts is that we do a full forward-pass through the model, using FakeTensors. When the forward pass hits one of these "missing compiled artifacts", then it loads it from disk.
- We don't need to run the full forward pass. The full forward pass on FakeTensors is slow. it should be possible to record all of the compiled artifacts we need to load and just load them all together and construct the right objects for runtime.

Other: this needs some more investigation.

### Feedback Period.

7/2 - 7/11

### CC List.

@ProExpertProg @youkaichao @WoosukKwon @robertgshaw2-redhat @jamesjwu @zhxchen17

### Any Other Things.

thank you

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: vLLM-compile (minus cudagraphs) warm-start time should be close to zero #20402

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: vLLM-compile (minus cudagraphs) warm-start time should be close to zero #20402

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions