Questions about simulation fidelity under vLLM version differences, profiling utilization, and throughput estimation

Hi authors, thanks a lot for the great work on APEX — it's a very impressive and useful tool for LLM serving optimization.

While experimenting with APEX, I had a few questions I hope you could help clarify:

1. vLLM version differences
    I noticed that your experiments use vLLM v0.5.4, but the latest release is now v0.9.2, which includes significant changes in memory management and token scheduling. In our tests, TTFT and TPOT vary noticeably between versions even under the same hardware/model/scripts. Would this discrepancy affect the fidelity of APEX’s predictions when used with newer versions?

2. Profiling vs. runtime GPU utilization
    During the profiling phase, GPU usage is typically close to 100% due to intensive op-level runs. However, in real-world serving (especially under latency-sensitive workloads), GPUs are often underutilized. Could this lead to a mismatch between simulated and actual performance?

3. Throughput estimation
    I see APEX focuses on latency, TTFT, and TPOT metrics. Is it also suitable for estimating end-to-end throughput (e.g., tokens/sec or requests/sec)? Any best practices or caveats here?

Would love to hear your thoughts — and again, thanks for open-sourcing APEX!

Best,
Harry

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions about simulation fidelity under vLLM version differences, profiling utilization, and throughput estimation #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions about simulation fidelity under vLLM version differences, profiling utilization, and throughput estimation #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions