Skip to content

Commit 16d2b92

Browse files
Mcirino1gshtras
andauthored
Updated README.md (#546)
* Updated README.md Waiting on benchmark results, do not publish yet * Changed "OOM" to "Out of memory" * Added throughput results * Added latency results * Trying to fix syntax --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
1 parent db892e7 commit 16d2b92

File tree

1 file changed

+53
-49
lines changed

1 file changed

+53
-49
lines changed

docs/dev-docker/README.md

Lines changed: 53 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -12,22 +12,21 @@ The pre-built image includes:
1212

1313
- ROCm™ 6.3.1
1414
- HipblasLT 0.15
15-
- vLLM 0.8.3
16-
- PyTorch 2.7dev (nightly)
15+
- vLLM 0.8.5
16+
- PyTorch 2.7
1717

1818
## Pull latest Docker Image
1919

2020
Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main`
2121

2222
## What is New
2323

24-
- [Improved DeepSeek-V3 and DeepSeek-R1 support](#running-deepseek-v3-and-deepseek-r1)
25-
- Initial Gemma-3 enablement
26-
- Detokenizer disablement
27-
- Torch.compile support
24+
- Out of memory bug fix
25+
- PyTorch fixes
26+
- Tunable ops fixes
2827

2928
## Known Issues and Workarounds
30-
- Mem fault encountered when running the model meta 405 fp8. To workaround this issue, set PYTORCH_TUNABLEOP_ENABLED=0
29+
- None
3130

3231
## Performance Results
3332

@@ -40,14 +39,14 @@ The table below shows performance data where a local inference client is fed req
4039

4140
| Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
4241
|-------|-----------|---------|-------|--------|-------------|--------------|-----------------------|
43-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 16364.9 |
44-
| | | | 128 | 4096 | 1500 | 1500 | 12171.0 |
45-
| | | | 500 | 2000 | 2000 | 2000 | 13290.4 |
46-
| | | | 2048 | 2048 | 1500 | 1500 | 8216.5 |
47-
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4331.6 |
48-
| | | | 128 | 4096 | 1500 | 1500 | 3409.9 |
49-
| | | | 500 | 2000 | 2000 | 2000 | 3184.0 |
50-
| | | | 2048 | 2048 | 500 | 500 | 2154.3 |
42+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 16892.6 |
43+
| | | | 128 | 4096 | 1500 | 1500 | 13916.7 |
44+
| | | | 500 | 2000 | 2000 | 2000 | 13616.1 |
45+
| | | | 2048 | 2048 | 1500 | 1500 | 8491.8 |
46+
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4380.3 |
47+
| | | | 128 | 4096 | 1500 | 1500 | 3404.2 |
48+
| | | | 500 | 2000 | 2000 | 2000 | 3251.3 |
49+
| | | | 2048 | 2048 | 500 | 500 | 2249.3 |
5150

5251
*TP stands for Tensor Parallelism.*
5352

@@ -57,42 +56,42 @@ The table below shows latency measurement, which typically involves assessing th
5756

5857
| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
5958
|-------|-----------|----------|------------|--------|---------|-------------------|
60-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.411 |
61-
| | | | 2 | 128 | 2048 | 18.750 |
62-
| | | | 4 | 128 | 2048 | 19.059 |
63-
| | | | 8 | 128 | 2048 | 20.857 |
64-
| | | | 16 | 128 | 2048 | 22.670 |
65-
| | | | 32 | 128 | 2048 | 25.495 |
66-
| | | | 64 | 128 | 2048 | 34.187 |
67-
| | | | 128 | 128 | 2048 | 48.754 |
68-
| | | | 1 | 2048 | 2048 | 17.699 |
69-
| | | | 2 | 2048 | 2048 | 18.919 |
70-
| | | | 4 | 2048 | 2048 | 19.220 |
71-
| | | | 8 | 2048 | 2048 | 21.545 |
72-
| | | | 16 | 2048 | 2048 | 24.329 |
73-
| | | | 32 | 2048 | 2048 | 29.461 |
74-
| | | | 64 | 2048 | 2048 | 40.148 |
75-
| | | | 128 | 2048 | 2048 | 61.382 |
76-
| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 46.601 |
77-
| | | | 2 | 128 | 2048 | 46.947 |
78-
| | | | 4 | 128 | 2048 | 48.971 |
79-
| | | | 8 | 128 | 2048 | 53.021 |
80-
| | | | 16 | 128 | 2048 | 55.836 |
81-
| | | | 32 | 128 | 2048 | 64.947 |
82-
| | | | 64 | 128 | 2048 | 81.408 |
83-
| | | | 128 | 128 | 2048 | 115.296 |
84-
| | | | 1 | 2048 | 2048 | 46.998 |
85-
| | | | 2 | 2048 | 2048 | 47.619 |
86-
| | | | 4 | 2048 | 2048 | 51.086 |
87-
| | | | 8 | 2048 | 2048 | 55.706 |
88-
| | | | 16 | 2048 | 2048 | 61.049 |
89-
| | | | 32 | 2048 | 2048 | 75.842 |
90-
| | | | 64 | 2048 | 2048 | 103.074 |
91-
| | | | 128 | 2048 | 2048 | 157.705 |
59+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 15.591 |
60+
| | | | 2 | 128 | 2048 | 16.865 |
61+
| | | | 4 | 128 | 2048 | 17.295 |
62+
| | | | 8 | 128 | 2048 | 18.939 |
63+
| | | | 16 | 128 | 2048 | 20.891 |
64+
| | | | 32 | 128 | 2048 | 23.402 |
65+
| | | | 64 | 128 | 2048 | 30.633 |
66+
| | | | 128 | 128 | 2048 | 43.898 |
67+
| | | | 1 | 2048 | 2048 | 15.678 |
68+
| | | | 2 | 2048 | 2048 | 16.892 |
69+
| | | | 4 | 2048 | 2048 | 17.781 |
70+
| | | | 8 | 2048 | 2048 | 19.536 |
71+
| | | | 16 | 2048 | 2048 | 22.521 |
72+
| | | | 32 | 2048 | 2048 | 26.729 |
73+
| | | | 64 | 2048 | 2048 | 36.794 |
74+
| | | | 128 | 2048 | 2048 | 56.371 |
75+
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 45.446 |
76+
| | | | 2 | 128 | 2048 | 46.223 |
77+
| | | | 4 | 128 | 2048 | 47.833 |
78+
| | | | 8 | 128 | 2048 | 52.085 |
79+
| | | | 16 | 128 | 2048 | 54.378 |
80+
| | | | 32 | 128 | 2048 | 63.108 |
81+
| | | | 64 | 128 | 2048 | 81.764 |
82+
| | | | 128 | 128 | 2048 | 109.479 |
83+
| | | | 1 | 2048 | 2048 | 46.001 |
84+
| | | | 2 | 2048 | 2048 | 46.720 |
85+
| | | | 4 | 2048 | 2048 | 49.250 |
86+
| | | | 8 | 2048 | 2048 | 54.495 |
87+
| | | | 16 | 2048 | 2048 | 59.539 |
88+
| | | | 32 | 2048 | 2048 | 73.906 |
89+
| | | | 64 | 2048 | 2048 | 103.847 |
90+
| | | | 128 | 2048 | 2048 | 151.613 |
9291

9392
*TP stands for Tensor Parallelism.*
9493

95-
Supermicro AS-8125GS-TNMR2 with 2x AMD EPYC 9554 Processors, 2.25 TiB RAM, 8x AMD Instinct MI300X (192GiB, 750W) GPUs, Ubuntu 22.04, and amdgpu driver 6.8.5
94+
Supermicro AS-8125GS-TNMR2 with 2x AMD EPYC 9575F Processors, 2.25 TiB RAM, 8x AMD Instinct MI300X (192GiB, 750W) GPUs, Ubuntu 22.04, and amdgpu driver 6.8.5
9695

9796
## Reproducing Benchmarked Results
9897

@@ -490,7 +489,7 @@ To reproduce the release docker:
490489
```bash
491490
git clone https://github.com/ROCm/vllm.git
492491
cd vllm
493-
git checkout b8498bc4a1c2aae1e25cfc780db0eadbc4716c67
492+
git checkout d60b5a337a552b6f74f511462d4ba67ea0ac4402
494493
docker build -f docker/Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
495494
```
496495

@@ -507,6 +506,11 @@ Use AITER release candidate branch instead:
507506

508507
## Changelog
509508

509+
20250513_aiter:
510+
- Out of memory bug fix
511+
- PyTorch fixes
512+
- Tunable ops fixes
513+
510514
20250410_aiter:
511515
- 2-stage MoE
512516
- MLA from AITER

0 commit comments

Comments
 (0)