Add unique prefix - increasing counter #217

MML-coder · 2025-07-01T20:54:41Z

The SyntheticTextItemsGenerator was generating prompts that could trigger vLLM's automatic prefix caching, leading to hitting the prefix cache up to 80% in some cases during the performance benchmarking.

Implemented unique prefix injection to guarantee 0% prefix cache hit rate while maintaining realistic prompt characteristics.

Test:
Performing some tests on the H200 target accelerator to confirm the fix.

MML-coder · 2025-07-08T18:58:55Z

I am trying to figure out lint errors. When i run it locally they all seemed to have passed. :)

ruff check --fix tests/unit/dataset/test_synthetic.py
All checks passed!

MML-coder · 2025-07-08T19:01:09Z

End to end test:

Ran following command for inference server running llama

command:
`
guidellm benchmark --target 'http://llama-4-maverick-fp8-c94dbf44-predictor.kserve-e2e-perf.svc.cluster.local:8080/v1' --model RedHatAI/Llama-4-Maverick-17B-128E-Instruct-FP8 --processor RedHatAI/Llama-4-Maverick-17B-128E-Instruct-FP8 --data='{"prompt_tokens":512 ,"prompt_tokens_stdev":128,"prompt_tokens_min":1,"prompt_tokens_max":1024,"output_tokens":2048,"output_tokens_stdev":64,"output_tokens_min":1,"output_tokens_max":4096}' --rate-type concurrent --rate "100" --warmup-percent 0.2 --max-requests 500 --output-path output.json

`

VLLM output:
INFO 07-08 17:56:44 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1809.3 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.5%, Prefix cache hit rate: 0.0% INFO 07-08 17:56:54 [loggers.py:116] Engine 000: Avg prompt throughput: 121.7 tokens/s, Avg generation throughput: 1689.8 tokens/s, Running: 99 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.7%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:04 [loggers.py:116] Engine 000: Avg prompt throughput: 1136.3 tokens/s, Avg generation throughput: 1267.3 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.9%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:14 [loggers.py:116] Engine 000: Avg prompt throughput: 1584.5 tokens/s, Avg generation throughput: 1106.8 tokens/s, Running: 99 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.2%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:24 [loggers.py:116] Engine 000: Avg prompt throughput: 1471.5 tokens/s, Avg generation throughput: 1096.7 tokens/s, Running: 98 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:34 [loggers.py:116] Engine 000: Avg prompt throughput: 611.2 tokens/s, Avg generation throughput: 1518.6 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:44 [loggers.py:116] Engine 000: Avg prompt throughput: 52.7 tokens/s, Avg generation throughput: 1629.9 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.8%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:54 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1759.5 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.3%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:04 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1769.4 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.9%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:14 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1769.2 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.4%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:24 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1789.4 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.9%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:34 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1799.9 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.4%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:44 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1839.6 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.0%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:54 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1829.2 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.5%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:04 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1724.9 tokens/s, Running: 92 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.4%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:14 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1309.3 tokens/s, Running: 46 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.3%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:24 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 426.8 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:34 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:44 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

nm-red-hat-upstream-automation-bot · 2025-07-09T19:24:25Z

📦 Build Artifacts Available
The build artifacts (.whl and .tar.gz) have been successfully generated and are available for download: https://github.com/neuralmagic/guidellm/actions/runs/16178342451/artifacts/3498543434.
They will be retained for up to 30 days.

MML-coder · 2025-07-09T21:04:40Z

pre-commit run --all-files trim trailing whitespace.................................................Passed fix end of files.........................................................Passed run linter...............................................................Passed run formatter............................................................Passed mypy.....................................................................Passed

MML-coder added 6 commits July 1, 2025 16:48

Initial commit to add unique prefix - increasing number

c2c72d7

testing the unique prefix by adding the timestamp +request_id

433d39c

remove rand prefix

eb1f241

going back with request_id approach

c3700a2

fixing ruff lint errors in test_synthetic py

230e689

fixing ruff lint errors in test_synthetic py

b3aaa94

MML-coder marked this pull request as ready for review July 8, 2025 18:58

Merge branch 'main' into prefix_cache_invalidate

7f35220

fixed precommit, lint/mypy errrors

54138aa

MML-coder self-assigned this Jul 9, 2025

vllm-project deleted a comment from nm-red-hat-upstream-automation-bot bot Jul 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add unique prefix - increasing counter #217

Add unique prefix - increasing counter #217

Uh oh!

MML-coder commented Jul 1, 2025

Uh oh!

MML-coder commented Jul 8, 2025

Uh oh!

MML-coder commented Jul 8, 2025

Uh oh!

nm-red-hat-upstream-automation-bot bot commented Jul 9, 2025

Uh oh!

MML-coder commented Jul 9, 2025

Uh oh!

Uh oh!

Add unique prefix - increasing counter #217

Are you sure you want to change the base?

Add unique prefix - increasing counter #217

Uh oh!

Conversation

MML-coder commented Jul 1, 2025

Uh oh!

MML-coder commented Jul 8, 2025

Uh oh!

MML-coder commented Jul 8, 2025

Uh oh!

nm-red-hat-upstream-automation-bot bot commented Jul 9, 2025

Uh oh!

MML-coder commented Jul 9, 2025

Uh oh!

Uh oh!