Skip to content

Commit 4607886

Browse files
committed
Update README to reflect migration to FP16 models
Migrated from Q4/Q8 quantization to FP16 models across documentation and examples for improved compatibility with TornadoVM. Adjusted download links, execution commands, and model references accordingly for consistency.
1 parent 7d23e49 commit 4607886

File tree

1 file changed

+40
-41
lines changed

1 file changed

+40
-41
lines changed

README.md

Lines changed: 40 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -43,17 +43,17 @@ Previous intergration of TornadoVM and Llama2 it can be found in <a href="https:
4343

4444
This table shows inference performance across different hardware and quantization options.
4545

46-
| Vendor / Backend | Hardware | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Optimizations |
47-
|:----------------------------:|:------------:|:---------------------:|:---------------------:|:---------------------:|:-------------:|
48-
| | | **Q8_0** | **Q4_0** | **Q4_0** | **Support** |
49-
| **NVIDIA / OpenCL-PTX** | RTX 3070 | 52 tokens/s | 50.56 tokens/s | 22.96 tokens/s ||
50-
| | RTX 4090 | 66.07 tokens/s | 65.81 tokens/s | 35.51 tokens/s ||
51-
| | RTX 5090 | 96.65 tokens/s | 94.71 tokens/s | 47.68 tokens/s ||
52-
| | L4 Tensor | 52.96 tokens/s | 52.92 tokens/s | 22.68 tokens/s ||
53-
| **Intel / OpenCL** | Arc A770 | 15.65 tokens/s | 15.09 tokens/s | 7.02 tokens/s | (WIP) |
54-
| **Apple Silicon / OpenCL** | M3 Pro | 14.04 tokens/s | 13.83 tokens/s | 6.78 tokens/s | (WIP) |
55-
| | M4 Pro | 16.77 tokens/s | 16.67 tokens/s | 8.56 tokens/s | (WIP) |
56-
| **AMD / OpenCL** | Radeon RX | (WIP) | (WIP) | (WIP) | (WIP) |
46+
| Vendor / Backend | Hardware | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Optimizations |
47+
|:----------------------------:|:------------:|:---------------------:|:---------------------:|:-------------:|
48+
| | | **FP16** | **FP16** | **Support** |
49+
| **NVIDIA / OpenCL-PTX** | RTX 3070 | 52 tokens/s | 22.96 tokens/s ||
50+
| | RTX 4090 | 66.07 tokens/s | 35.51 tokens/s ||
51+
| | RTX 5090 | 96.65 tokens/s | 47.68 tokens/s ||
52+
| | L4 Tensor | 52.96 tokens/s | 22.68 tokens/s ||
53+
| **Intel / OpenCL** | Arc A770 | 15.65 tokens/s | 7.02 tokens/s | (WIP) |
54+
| **Apple Silicon / OpenCL** | M3 Pro | 14.04 tokens/s | 6.78 tokens/s | (WIP) |
55+
| | M4 Pro | 16.77 tokens/s | 8.56 tokens/s | (WIP) |
56+
| **AMD / OpenCL** | Radeon RX | (WIP) | (WIP) | (WIP) |
5757

5858
##### ⚠️ Note on Apple Silicon Performance
5959

@@ -118,42 +118,46 @@ source set_paths
118118
make
119119

120120
# Run the model (make sure you have downloaded the model file first - see below)
121-
./llama-tornado --gpu --verbose-init --opencl --model Llama-3.2-1B-Instruct-Q4_0.gguf --prompt "tell me a joke"
121+
./llama-tornado --gpu --verbose-init --opencl --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "tell me a joke"
122122
```
123123
-----------
124124

125-
The above model can we swapped with one of the other models, such as `Llama-3.2-3B-Instruct-Q4_0.gguf` or `Meta-Llama-3-8B-Instruct-Q4_0.gguf`, depending on your needs.
125+
The above model can we swapped with one of the other models, such as `beehive-llama-3.2-3b-instruct-fp16.gguf` or `beehive-llama-3.2-8b-instruct-fp16.gguf`, depending on your needs.
126126
Check models below.
127127

128128
## Download Model Files
129129

130-
Download `Q4_0` and (optionally) `Q8_0` quantized .gguf files from:
131-
- https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF
132-
- https://huggingface.co/mukel/Llama-3.2-3B-Instruct-GGUF
133-
- https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF
134-
- https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF
130+
Download `FP16` quantized .gguf files from:
131+
- https://huggingface.co/beehive-lab/Llama-3.2-1B-Instruct-GGUF-FP16
132+
- https://huggingface.co/beehive-lab/Llama-3.2-3B-Instruct-GGUF-FP16
133+
- https://huggingface.co/beehive-lab/Llama-3.2-8B-Instruct-GGUF-FP16
135134

136-
The `Q4_0` quantized models are recommended, except for the very small models (1B), please be gentle with [huggingface.co](https://huggingface.co) servers:
135+
Please be gentle with [huggingface.co](https://huggingface.co) servers:
136+
137+
**Note** FP16 models are first-class citizens for the current version.
138+
```
139+
# Llama 3.2 (1B) - FP16
140+
wget https://huggingface.co/beehive-lab/Llama-3.2-1B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-1b-instruct-fp16.gguf
141+
142+
# Llama 3.2 (3B) - FP16
143+
wget https://huggingface.co/beehive-lab/Llama-3.2-3B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-3b-instruct-fp16.gguf
144+
145+
# Llama 3 (8B) - FP16
146+
wget https://huggingface.co/beehive-lab/Llama-3.2-8B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-8b-instruct-fp16.gguf
147+
```
148+
149+
**[Experimental]** you can download the Q8 and Q4 used in the original implementation of Llama3.java, but for now are going to be dequanted to FP16 for TornadoVM support:
137150
```
138151
# Llama 3.2 (1B) - Q4_0
139152
curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf
140-
141153
# Llama 3.2 (3B) - Q4_0
142154
curl -L -O https://huggingface.co/mukel/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_0.gguf
143-
144155
# Llama 3 (8B) - Q4_0
145156
curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf
146-
147157
# Llama 3.2 (1B) - Q8_0
148158
curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf
149-
150159
# Llama 3.1 (8B) - Q8_0
151160
curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_0.gguf
152-
153-
# Llama 3 (8B) - Q8_0
154-
# Optionally download the Q8_0 quantized models
155-
# curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gguf
156-
# curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
157161
```
158162

159163
-----------
@@ -168,18 +172,13 @@ To execute Llama3 models with TornadoVM on GPUs use the `llama-tornado` script w
168172
Run a model with a text prompt:
169173

170174
```bash
171-
./llama-tornado --gpu --verbose-init --opencl --model Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "Explain the benefits of GPU acceleration."
175+
./llama-tornado --gpu --verbose-init --opencl --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "Explain the benefits of GPU acceleration."
172176
```
173177

174-
#### GPU Execution (Q8_0 Model)
178+
#### GPU Execution (FP16 Model)
175179
Enable GPU acceleration with Q8_0 quantization:
176180
```bash
177-
llama-tornado --gpu --verbose-init --model Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "tell me a joke"
178-
```
179-
#### GPU Execution (Q4_0 Model)
180-
Run with Q4_0 quantization for lower memory usage:
181-
```bash
182-
llama-tornado --gpu --verbose-init --model Llama-3.2-1B-Instruct-Q4_0.gguf --prompt "tell me a joke"
181+
llama-tornado --gpu --verbose-init --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "tell me a joke"
183182
```
184183

185184
-----------
@@ -188,7 +187,7 @@ llama-tornado --gpu --verbose-init --model Llama-3.2-1B-Instruct-Q4_0.gguf --pr
188187

189188
### Out of Memory Error
190189

191-
You may encounter an out of memory error like:
190+
You may encounter an out-of-memory error like:
192191
```
193192
Exception in thread "main" uk.ac.manchester.tornado.api.exceptions.TornadoOutOfMemoryException: Unable to allocate 100663320 bytes of memory.
194193
To increase the maximum device memory, use -Dtornado.device.memory=<X>GB
@@ -202,10 +201,10 @@ First, check your GPU specifications. If your GPU has high memory capacity, you
202201

203202
```bash
204203
# For 3B models, try increasing to 15GB
205-
./llama-tornado --gpu --model Llama-3.2-3B-Instruct-Q4_0.gguf --prompt "Tell me a joke" --gpu-memory 15GB
204+
./llama-tornado --gpu --model beehive-llama-3.2-3b-instruct-fp16.gguf --prompt "Tell me a joke" --gpu-memory 15GB
206205

207206
# For 8B models, you may need even more (20GB or higher)
208-
./llama-tornado --gpu --model Meta-Llama-3-8B-Instruct-Q4_0.gguf --prompt "Tell me a joke" --gpu-memory 20GB
207+
./llama-tornado --gpu --model beehive-llama-3.2-8b-instruct-fp16.gguf --prompt "Tell me a joke" --gpu-memory 20GB
209208
```
210209

211210
### GPU Memory Requirements by Model Size
@@ -320,7 +319,7 @@ This flag shows the exact Java command with all JVM flags that are being invoked
320319
Hence, it makes it simple to replicate or embed the invoked flags in any external tool or codebase.
321320

322321
```bash
323-
llama-tornado --gpu --model Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "tell me a joke" --show-command
322+
llama-tornado --gpu --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "tell me a joke" --show-command
324323
```
325324

326325
<details>
@@ -363,7 +362,7 @@ llama-tornado --gpu --model Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "tell me a
363362
--add-modules ALL-SYSTEM,tornado.runtime,tornado.annotation,tornado.drivers.common,tornado.drivers.opencl \
364363
-cp /home/mikepapadim/repos/gpu-llama3.java/target/gpu-llama3-1.0-SNAPSHOT.jar \
365364
com.example.LlamaApp \
366-
-m Llama-3.2-1B-Instruct-Q8_0.gguf \
365+
-m beehive-llama-3.2-1b-instruct-fp16.gguf \
367366
--temperature 0.1 \
368367
--top-p 0.95 \
369368
--seed 1746903566 \

0 commit comments

Comments
 (0)