-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
- Follow the suggested Mac installation
- Use: Meta-Llama-3-8B-Instruct-Q4_0.gguf
Expected behavior
Model inference slightly faster than Llama3.java
Screenshots
TornadoVM GPU execution plan creation: 523,85 ms
Java to GPU JIT compiler warmup: 3019,81 ms
Exception in thread "main" uk.ac.manchester.tornado.api.exceptions.TornadoOutOfMemoryException: Unable to allocate 117440536 bytes of memory.
To increase the maximum device memory, use -Dtornado.device.memory=GB
at tornado.drivers.common@1.1.1-dev/uk.ac.manchester.tornado.drivers.common.TornadoBufferProvider.freeUnusedNativeBufferAndAssignRegion(TornadoBufferProvider.java:184)
at tornado.drivers.common@1.1.1-dev/uk.ac.manchester.tornado.drivers.common.TornadoBufferProvider.getOrAllocateBufferWithSize(TornadoBufferProvider.java:211)
at tornado.drivers.opencl@1.1.1-dev/uk.ac.manchester.tornado.drivers.opencl.mm.OCLMemorySegmentWrapper.allocate(OCLMemorySegmentWrapper.java:184)
at tornado.drivers.opencl@1.1.1-dev/uk.ac.manchester.tornado.drivers.opencl.runtime.OCLTornadoDevice.newDeviceBufferAllocation(OCLTornadoDevice.java:617)
at tornado.drivers.opencl@1.1.1-dev/uk.ac.manchester.tornado.drivers.opencl.runtime.OCLTornadoDevice.allocate(OCLTornadoDevice.java:630)
at tornado.drivers.opencl@1.1.1-dev/uk.ac.manchester.tornado.drivers.opencl.runtime.OCLTornadoDevice.allocateObjects(OCLTornadoDevice.java:593)
at tornado.runtime@1.1.1-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.executeAlloc(TornadoVMInterpreter.java:499)
at tornado.runtime@1.1.1-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:296)
at tornado.runtime@1.1.1-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:1028)
at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:1024)
at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
at tornado.runtime@1.1.1-dev/uk.ac.manchester.tornado.runtime.TornadoVM.executeInterpreterSingleThreaded(TornadoVM.java:127)
Desktop (please complete the following information):
- OS: Sequoia 15.5
- Installed Java OpenJDK Runtime Environment Corretto-21.0.0.35.1 (build 21+35-LTS
Fix
./llama-tornado --gpu-memory 96GB --gpu --verbose-init --opencl --model Meta-Llama-3-8B-Instruct-Q4_0.gguf --prompt "tell me a joke"
(using -gpu-memory 96GB instead of -Dtornado.device.memory=96GB)
LLama3.java is slightly faster:
java --enable-preview --source 21 --add-modules jdk.incubator.vector LLama3.java -i --model Meta-Llama-3-8B-Instruct-Q4_0.gguf
Note: LLama3.java uses preview features of Java SE 21.
Note: Recompile with -Xlint:preview for details.
Parse Meta-Llama-3-8B-Instruct-Q4_0.gguf: 417 millis
Load LlaMa model: 576 millis
tell me a joke
Here's one:
Why couldn't the bicycle stand up by itself?
(Wait for it...)
Because it was two-tired!
Hope that made you smile!
7,35 tokens/s (47)
vs.
./llama-tornado --gpu-memory 96GB --gpu --verbose-init --opencl --model Meta-Llama-3-8B-Instruct-Q4_0.gguf --prompt "tell me a joke"
WARNING: Using incubator modules: jdk.incubator.vector
Parse Meta-Llama-3-8B-Instruct-Q4_0.gguf: 403 millis
Loading model weights in TornadoVM format (loading Q4_0 -> F16)
Load LlaMa model: 19720 millis
Starting TornadoVM initialization...
TornadoVM GPU execution plan creation: 370,35 ms
Java to GPU JIT compiler warmup: 1105,70 ms
Transfer read-only weights to GPU: 2001,91 ms
Finished TornadoVM initialization...
Here's one:
Why couldn't the bicycle stand up by itself?
(wait for it...)
Because it was two-tired!
Hope that made you smile!
achieved tok/s: 5,10. Tokens: 46, seconds: 9,02