You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi authors, thanks a lot for the great work on APEX — it's a very impressive and useful tool for LLM serving optimization.
While experimenting with APEX, I had a few questions I hope you could help clarify:
vLLM version differences
I noticed that your experiments use vLLM v0.5.4, but the latest release is now v0.9.2, which includes significant changes in memory management and token scheduling. In our tests, TTFT and TPOT vary noticeably between versions even under the same hardware/model/scripts. Would this discrepancy affect the fidelity of APEX’s predictions when used with newer versions?
Profiling vs. runtime GPU utilization
During the profiling phase, GPU usage is typically close to 100% due to intensive op-level runs. However, in real-world serving (especially under latency-sensitive workloads), GPUs are often underutilized. Could this lead to a mismatch between simulated and actual performance?
Throughput estimation
I see APEX focuses on latency, TTFT, and TPOT metrics. Is it also suitable for estimating end-to-end throughput (e.g., tokens/sec or requests/sec)? Any best practices or caveats here?
Would love to hear your thoughts — and again, thanks for open-sourcing APEX!