Skip to content

Questions about simulation fidelity under vLLM version differences, profiling utilization, and throughput estimation #8

@hariag

Description

@hariag

Hi authors, thanks a lot for the great work on APEX — it's a very impressive and useful tool for LLM serving optimization.

While experimenting with APEX, I had a few questions I hope you could help clarify:

  1. vLLM version differences
    I noticed that your experiments use vLLM v0.5.4, but the latest release is now v0.9.2, which includes significant changes in memory management and token scheduling. In our tests, TTFT and TPOT vary noticeably between versions even under the same hardware/model/scripts. Would this discrepancy affect the fidelity of APEX’s predictions when used with newer versions?

  2. Profiling vs. runtime GPU utilization
    During the profiling phase, GPU usage is typically close to 100% due to intensive op-level runs. However, in real-world serving (especially under latency-sensitive workloads), GPUs are often underutilized. Could this lead to a mismatch between simulated and actual performance?

  3. Throughput estimation
    I see APEX focuses on latency, TTFT, and TPOT metrics. Is it also suitable for estimating end-to-end throughput (e.g., tokens/sec or requests/sec)? Any best practices or caveats here?

Would love to hear your thoughts — and again, thanks for open-sourcing APEX!

Best,
Harry

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions