@@ -12,22 +12,21 @@ The pre-built image includes:
12
12
13
13
- ROCm™ 6.3.1
14
14
- HipblasLT 0.15
15
- - vLLM 0.8.3
16
- - PyTorch 2.7dev (nightly)
15
+ - vLLM 0.8.5
16
+ - PyTorch 2.7
17
17
18
18
## Pull latest Docker Image
19
19
20
20
Pull the most recent validated docker image with ` docker pull rocm/vllm-dev:main `
21
21
22
22
## What is New
23
23
24
- - [ Improved DeepSeek-V3 and DeepSeek-R1 support] ( #running-deepseek-v3-and-deepseek-r1 )
25
- - Initial Gemma-3 enablement
26
- - Detokenizer disablement
27
- - Torch.compile support
24
+ - Out of memory bug fix
25
+ - PyTorch fixes
26
+ - Tunable ops fixes
28
27
29
28
## Known Issues and Workarounds
30
- - Mem fault encountered when running the model meta 405 fp8. To workaround this issue, set PYTORCH_TUNABLEOP_ENABLED=0
29
+ - None
31
30
32
31
## Performance Results
33
32
@@ -40,14 +39,14 @@ The table below shows performance data where a local inference client is fed req
40
39
41
40
| Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
42
41
| -------| -----------| ---------| -------| --------| -------------| --------------| -----------------------|
43
- | Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 16364.9 |
44
- | | | | 128 | 4096 | 1500 | 1500 | 12171.0 |
45
- | | | | 500 | 2000 | 2000 | 2000 | 13290.4 |
46
- | | | | 2048 | 2048 | 1500 | 1500 | 8216.5 |
47
- | Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4331.6 |
48
- | | | | 128 | 4096 | 1500 | 1500 | 3409.9 |
49
- | | | | 500 | 2000 | 2000 | 2000 | 3184.0 |
50
- | | | | 2048 | 2048 | 500 | 500 | 2154 .3 |
42
+ | Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 16892.6 |
43
+ | | | | 128 | 4096 | 1500 | 1500 | 13916.7 |
44
+ | | | | 500 | 2000 | 2000 | 2000 | 13616.1 |
45
+ | | | | 2048 | 2048 | 1500 | 1500 | 8491.8 |
46
+ | Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4380.3 |
47
+ | | | | 128 | 4096 | 1500 | 1500 | 3404.2 |
48
+ | | | | 500 | 2000 | 2000 | 2000 | 3251.3 |
49
+ | | | | 2048 | 2048 | 500 | 500 | 2249 .3 |
51
50
52
51
* TP stands for Tensor Parallelism.*
53
52
@@ -57,42 +56,42 @@ The table below shows latency measurement, which typically involves assessing th
57
56
58
57
| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
59
58
| -------| -----------| ----------| ------------| --------| ---------| -------------------|
60
- | Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.411 |
61
- | | | | 2 | 128 | 2048 | 18.750 |
62
- | | | | 4 | 128 | 2048 | 19.059 |
63
- | | | | 8 | 128 | 2048 | 20.857 |
64
- | | | | 16 | 128 | 2048 | 22.670 |
65
- | | | | 32 | 128 | 2048 | 25.495 |
66
- | | | | 64 | 128 | 2048 | 34.187 |
67
- | | | | 128 | 128 | 2048 | 48.754 |
68
- | | | | 1 | 2048 | 2048 | 17.699 |
69
- | | | | 2 | 2048 | 2048 | 18.919 |
70
- | | | | 4 | 2048 | 2048 | 19.220 |
71
- | | | | 8 | 2048 | 2048 | 21.545 |
72
- | | | | 16 | 2048 | 2048 | 24.329 |
73
- | | | | 32 | 2048 | 2048 | 29.461 |
74
- | | | | 64 | 2048 | 2048 | 40.148 |
75
- | | | | 128 | 2048 | 2048 | 61.382 |
76
- | Llama 3.1 405B (amd/Llama-3.1-70B -Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 46.601 |
77
- | | | | 2 | 128 | 2048 | 46.947 |
78
- | | | | 4 | 128 | 2048 | 48.971 |
79
- | | | | 8 | 128 | 2048 | 53.021 |
80
- | | | | 16 | 128 | 2048 | 55.836 |
81
- | | | | 32 | 128 | 2048 | 64.947 |
82
- | | | | 64 | 128 | 2048 | 81.408 |
83
- | | | | 128 | 128 | 2048 | 115.296 |
84
- | | | | 1 | 2048 | 2048 | 46.998 |
85
- | | | | 2 | 2048 | 2048 | 47.619 |
86
- | | | | 4 | 2048 | 2048 | 51.086 |
87
- | | | | 8 | 2048 | 2048 | 55.706 |
88
- | | | | 16 | 2048 | 2048 | 61.049 |
89
- | | | | 32 | 2048 | 2048 | 75.842 |
90
- | | | | 64 | 2048 | 2048 | 103.074 |
91
- | | | | 128 | 2048 | 2048 | 157.705 |
59
+ | Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 15.591 |
60
+ | | | | 2 | 128 | 2048 | 16.865 |
61
+ | | | | 4 | 128 | 2048 | 17.295 |
62
+ | | | | 8 | 128 | 2048 | 18.939 |
63
+ | | | | 16 | 128 | 2048 | 20.891 |
64
+ | | | | 32 | 128 | 2048 | 23.402 |
65
+ | | | | 64 | 128 | 2048 | 30.633 |
66
+ | | | | 128 | 128 | 2048 | 43.898 |
67
+ | | | | 1 | 2048 | 2048 | 15.678 |
68
+ | | | | 2 | 2048 | 2048 | 16.892 |
69
+ | | | | 4 | 2048 | 2048 | 17.781 |
70
+ | | | | 8 | 2048 | 2048 | 19.536 |
71
+ | | | | 16 | 2048 | 2048 | 22.521 |
72
+ | | | | 32 | 2048 | 2048 | 26.729 |
73
+ | | | | 64 | 2048 | 2048 | 36.794 |
74
+ | | | | 128 | 2048 | 2048 | 56.371 |
75
+ | Llama 3.1 405B (amd/Llama-3.1-405B -Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 45.446 |
76
+ | | | | 2 | 128 | 2048 | 46.223 |
77
+ | | | | 4 | 128 | 2048 | 47.833 |
78
+ | | | | 8 | 128 | 2048 | 52.085 |
79
+ | | | | 16 | 128 | 2048 | 54.378 |
80
+ | | | | 32 | 128 | 2048 | 63.108 |
81
+ | | | | 64 | 128 | 2048 | 81.764 |
82
+ | | | | 128 | 128 | 2048 | 109.479 |
83
+ | | | | 1 | 2048 | 2048 | 46.001 |
84
+ | | | | 2 | 2048 | 2048 | 46.720 |
85
+ | | | | 4 | 2048 | 2048 | 49.250 |
86
+ | | | | 8 | 2048 | 2048 | 54.495 |
87
+ | | | | 16 | 2048 | 2048 | 59.539 |
88
+ | | | | 32 | 2048 | 2048 | 73.906 |
89
+ | | | | 64 | 2048 | 2048 | 103.847 |
90
+ | | | | 128 | 2048 | 2048 | 151.613 |
92
91
93
92
* TP stands for Tensor Parallelism.*
94
93
95
- Supermicro AS-8125GS-TNMR2 with 2x AMD EPYC 9554 Processors, 2.25 TiB RAM, 8x AMD Instinct MI300X (192GiB, 750W) GPUs, Ubuntu 22.04, and amdgpu driver 6.8.5
94
+ Supermicro AS-8125GS-TNMR2 with 2x AMD EPYC 9575F Processors, 2.25 TiB RAM, 8x AMD Instinct MI300X (192GiB, 750W) GPUs, Ubuntu 22.04, and amdgpu driver 6.8.5
96
95
97
96
## Reproducing Benchmarked Results
98
97
@@ -490,7 +489,7 @@ To reproduce the release docker:
490
489
``` bash
491
490
git clone https://github.com/ROCm/vllm.git
492
491
cd vllm
493
- git checkout b8498bc4a1c2aae1e25cfc780db0eadbc4716c67
492
+ git checkout d60b5a337a552b6f74f511462d4ba67ea0ac4402
494
493
docker build -f docker/Dockerfile.rocm -t < your_tag> --build-arg USE_CYTHON=1 .
495
494
```
496
495
@@ -507,6 +506,11 @@ Use AITER release candidate branch instead:
507
506
508
507
## Changelog
509
508
509
+ 20250513_aiter:
510
+ - Out of memory bug fix
511
+ - PyTorch fixes
512
+ - Tunable ops fixes
513
+
510
514
20250410_aiter:
511
515
- 2-stage MoE
512
516
- MLA from AITER
0 commit comments