@@ -43,17 +43,17 @@ Previous intergration of TornadoVM and Llama2 it can be found in <a href="https:
43
43
44
44
This table shows inference performance across different hardware and quantization options.
45
45
46
- | Vendor / Backend | Hardware | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct | Llama-3.2- 3B-Instruct | Optimizations |
47
- | :----------------------------:| :------------:| :---------------------:| :---------------------:| :---------------------: | :------------- :|
48
- | | | ** Q8_0 ** | ** Q4_0 ** | ** Q4_0 ** | ** Support** |
49
- | ** NVIDIA / OpenCL-PTX** | RTX 3070 | 52 tokens/s | 50.56 tokens/s | 22.96 tokens/s | ✅ |
50
- | | RTX 4090 | 66.07 tokens/s | 65.81 tokens/s | 35.51 tokens/s | ✅ |
51
- | | RTX 5090 | 96.65 tokens/s | 94.71 tokens/s | 47.68 tokens/s | ✅ |
52
- | | L4 Tensor | 52.96 tokens/s | 52.92 tokens/s | 22.68 tokens/s | ✅ |
53
- | ** Intel / OpenCL** | Arc A770 | 15.65 tokens/s | 15.09 tokens/s | 7.02 tokens/s | (WIP) |
54
- | ** Apple Silicon / OpenCL** | M3 Pro | 14.04 tokens/s | 13.83 tokens/s | 6.78 tokens/s | (WIP) |
55
- | | M4 Pro | 16.77 tokens/s | 16.67 tokens/s | 8.56 tokens/s | (WIP) |
56
- | ** AMD / OpenCL** | Radeon RX | (WIP) | (WIP) | (WIP) | (WIP) |
46
+ | Vendor / Backend | Hardware | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Optimizations |
47
+ | :----------------------------:| :------------:| :---------------------:| :---------------------:| :-------------:|
48
+ | | | ** FP16 ** | ** FP16 ** | ** Support** |
49
+ | ** NVIDIA / OpenCL-PTX** | RTX 3070 | 52 tokens/s | 22.96 tokens/s | ✅ |
50
+ | | RTX 4090 | 66.07 tokens/s | 35.51 tokens/s | ✅ |
51
+ | | RTX 5090 | 96.65 tokens/s | 47.68 tokens/s | ✅ |
52
+ | | L4 Tensor | 52.96 tokens/s | 22.68 tokens/s | ✅ |
53
+ | ** Intel / OpenCL** | Arc A770 | 15.65 tokens/s | 7.02 tokens/s | (WIP) |
54
+ | ** Apple Silicon / OpenCL** | M3 Pro | 14.04 tokens/s | 6.78 tokens/s | (WIP) |
55
+ | | M4 Pro | 16.77 tokens/s | 8.56 tokens/s | (WIP) |
56
+ | ** AMD / OpenCL** | Radeon RX | (WIP) | (WIP) | (WIP) |
57
57
58
58
##### ⚠️ Note on Apple Silicon Performance
59
59
@@ -118,42 +118,46 @@ source set_paths
118
118
make
119
119
120
120
# Run the model (make sure you have downloaded the model file first - see below)
121
- ./llama-tornado --gpu --verbose-init --opencl --model Llama- 3.2-1B-Instruct-Q4_0 .gguf --prompt " tell me a joke"
121
+ ./llama-tornado --gpu --verbose-init --opencl --model beehive-llama- 3.2-1b-instruct-fp16 .gguf --prompt " tell me a joke"
122
122
```
123
123
-----------
124
124
125
- The above model can we swapped with one of the other models, such as ` Llama- 3.2-3B-Instruct-Q4_0 .gguf` or ` Meta-Llama-3-8B-Instruct-Q4_0 .gguf` , depending on your needs.
125
+ The above model can we swapped with one of the other models, such as ` beehive-llama- 3.2-3b-instruct-fp16 .gguf` or ` beehive-llama-3.2-8b-instruct-fp16 .gguf` , depending on your needs.
126
126
Check models below.
127
127
128
128
## Download Model Files
129
129
130
- Download ` Q4_0 ` and (optionally) ` Q8_0 ` quantized .gguf files from:
131
- - https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF
132
- - https://huggingface.co/mukel/Llama-3.2-3B-Instruct-GGUF
133
- - https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF
134
- - https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF
130
+ Download ` FP16 ` quantized .gguf files from:
131
+ - https://huggingface.co/beehive-lab/Llama-3.2-1B-Instruct-GGUF-FP16
132
+ - https://huggingface.co/beehive-lab/Llama-3.2-3B-Instruct-GGUF-FP16
133
+ - https://huggingface.co/beehive-lab/Llama-3.2-8B-Instruct-GGUF-FP16
135
134
136
- The ` Q4_0 ` quantized models are recommended, except for the very small models (1B), please be gentle with [ huggingface.co] ( https://huggingface.co ) servers:
135
+ Please be gentle with [ huggingface.co] ( https://huggingface.co ) servers:
136
+
137
+ ** Note** FP16 models are first-class citizens for the current version.
138
+ ```
139
+ # Llama 3.2 (1B) - FP16
140
+ wget https://huggingface.co/beehive-lab/Llama-3.2-1B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-1b-instruct-fp16.gguf
141
+
142
+ # Llama 3.2 (3B) - FP16
143
+ wget https://huggingface.co/beehive-lab/Llama-3.2-3B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-3b-instruct-fp16.gguf
144
+
145
+ # Llama 3 (8B) - FP16
146
+ wget https://huggingface.co/beehive-lab/Llama-3.2-8B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-8b-instruct-fp16.gguf
147
+ ```
148
+
149
+ ** [ Experimental] ** you can download the Q8 and Q4 used in the original implementation of Llama3.java, but for now are going to be dequanted to FP16 for TornadoVM support:
137
150
```
138
151
# Llama 3.2 (1B) - Q4_0
139
152
curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf
140
-
141
153
# Llama 3.2 (3B) - Q4_0
142
154
curl -L -O https://huggingface.co/mukel/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_0.gguf
143
-
144
155
# Llama 3 (8B) - Q4_0
145
156
curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf
146
-
147
157
# Llama 3.2 (1B) - Q8_0
148
158
curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf
149
-
150
159
# Llama 3.1 (8B) - Q8_0
151
160
curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_0.gguf
152
-
153
- # Llama 3 (8B) - Q8_0
154
- # Optionally download the Q8_0 quantized models
155
- # curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gguf
156
- # curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
157
161
```
158
162
159
163
-----------
@@ -168,18 +172,13 @@ To execute Llama3 models with TornadoVM on GPUs use the `llama-tornado` script w
168
172
Run a model with a text prompt:
169
173
170
174
``` bash
171
- ./llama-tornado --gpu --verbose-init --opencl --model Llama- 3.2-1B-Instruct-Q8_0 .gguf --prompt " Explain the benefits of GPU acceleration."
175
+ ./llama-tornado --gpu --verbose-init --opencl --model beehive-llama- 3.2-1b-instruct-fp16 .gguf --prompt " Explain the benefits of GPU acceleration."
172
176
```
173
177
174
- #### GPU Execution (Q8_0 Model)
178
+ #### GPU Execution (FP16 Model)
175
179
Enable GPU acceleration with Q8_0 quantization:
176
180
``` bash
177
- llama-tornado --gpu --verbose-init --model Llama-3.2-1B-Instruct-Q8_0.gguf --prompt " tell me a joke"
178
- ```
179
- #### GPU Execution (Q4_0 Model)
180
- Run with Q4_0 quantization for lower memory usage:
181
- ``` bash
182
- llama-tornado --gpu --verbose-init --model Llama-3.2-1B-Instruct-Q4_0.gguf --prompt " tell me a joke"
181
+ llama-tornado --gpu --verbose-init --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt " tell me a joke"
183
182
```
184
183
185
184
-----------
@@ -188,7 +187,7 @@ llama-tornado --gpu --verbose-init --model Llama-3.2-1B-Instruct-Q4_0.gguf --pr
188
187
189
188
### Out of Memory Error
190
189
191
- You may encounter an out of memory error like:
190
+ You may encounter an out-of- memory error like:
192
191
```
193
192
Exception in thread "main" uk.ac.manchester.tornado.api.exceptions.TornadoOutOfMemoryException: Unable to allocate 100663320 bytes of memory.
194
193
To increase the maximum device memory, use -Dtornado.device.memory=<X>GB
@@ -202,10 +201,10 @@ First, check your GPU specifications. If your GPU has high memory capacity, you
202
201
203
202
``` bash
204
203
# For 3B models, try increasing to 15GB
205
- ./llama-tornado --gpu --model Llama- 3.2-3B-Instruct-Q4_0 .gguf --prompt " Tell me a joke" --gpu-memory 15GB
204
+ ./llama-tornado --gpu --model beehive-llama- 3.2-3b-instruct-fp16 .gguf --prompt " Tell me a joke" --gpu-memory 15GB
206
205
207
206
# For 8B models, you may need even more (20GB or higher)
208
- ./llama-tornado --gpu --model Meta-Llama-3-8B-Instruct-Q4_0 .gguf --prompt " Tell me a joke" --gpu-memory 20GB
207
+ ./llama-tornado --gpu --model beehive-llama-3.2-8b-instruct-fp16 .gguf --prompt " Tell me a joke" --gpu-memory 20GB
209
208
```
210
209
211
210
### GPU Memory Requirements by Model Size
@@ -320,7 +319,7 @@ This flag shows the exact Java command with all JVM flags that are being invoked
320
319
Hence, it makes it simple to replicate or embed the invoked flags in any external tool or codebase.
321
320
322
321
``` bash
323
- llama-tornado --gpu --model Llama- 3.2-1B-Instruct-Q8_0 .gguf --prompt " tell me a joke" --show-command
322
+ llama-tornado --gpu --model beehive-llama- 3.2-1b-instruct-fp16 .gguf --prompt " tell me a joke" --show-command
324
323
```
325
324
326
325
<details >
@@ -363,7 +362,7 @@ llama-tornado --gpu --model Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "tell me a
363
362
-- add- modules ALL - SYSTEM ,tornado. runtime,tornado. annotation,tornado. drivers. common,tornado. drivers. opencl \
364
363
- cp / home/ mikepapadim/ repos/ gpu- llama3. java/ target/ gpu- llama3- 1.0 - SNAPSHOT . jar \
365
364
com.example. LlamaApp \
366
- - m Llama - 3.2 - 1B - Instruct - Q8_0 . gguf \
365
+ - m beehive - llama - 3.2 - 1b - instruct - fp16 . gguf \
367
366
-- temperature 0.1 \
368
367
-- top- p 0.95 \
369
368
-- seed 1746903566 \
0 commit comments