RAG ? #4

abhishku · 2024-07-02T03:47:04Z

abhishku
Jul 2, 2024

Any Idea how I can start implementing a RAG keeping this as a base ? Full java from scratch using some Vector DB

BharatBlade · 2025-03-08T15:27:28Z

BharatBlade
Mar 8, 2025

I just discovered this project after coming back to LLMs after so long and was amazed by this project. If you haven't already tried it, I would recommend checking out LangChain4J https://github.com/langchain4j/langchain4j as they have demos of RAG in Java, but it does require Ollama. This llama3.java implementation doesn't appear to use Ollama? which is very impressive. Having RAG would be amazing, but understandable if it's something further down the road in development.

0 replies

neocoretechs · 2025-05-14T15:39:48Z

neocoretechs
May 14, 2025

I implemented low rent RAGLite with Jsoup; detect URL typed in cmdl with Xpath directives at the end, grab the content with Jsoup, do the Xpath, replace the cmdl, and wha la! Also integrated my DB to store dialog and reload context from cmdl: https://github.com/neocoretechs/Llama4j

3 replies

mukel May 14, 2025
Maintainer

Nice! I have a prototype that can run Llama 3, Qwen 2/3, Mistral, Smol2, Phi 4, Gemma, Granite, Yi, the R1 distills ... more aligned with llama.cpp, but in pure Java, using the same inference code.

Would you be interested in having a pure Java library with no dependencies, ~90% of the performance of llama.cpp (CPU only, inference only) that you can consume via Maven with a sane/beautiful API?

I'm not a Java purist, some native code could be included to squeeze the absolute maximum performance if needed, but it's only really needed for matrix multiplications, the inference, model definitions, tokenizers... and most tensor operations run pretty fast in Java.

neocoretechs May 15, 2025

Absolutely! That sounds great! I have code already that uses JNI for double precision matrix dot product in CUDA, uses the cublasDGemm to do single and batch but its an older version. I think 12.1 or so.
https://github.com/neocoretechs/matrixDotProduct_vs
Its a visual studio oriented repos. I also would like to try the RKNN API for the RK3588 with onboard NPU. I have code that uses their RKNN API, and I see they added matmul to the API after I wrote that. So, the long and short of it is; sure! If there were native crosscuts in the code, I could surely try some different things with it!
Here are the RK3588 repos, FYI:
https://github.com/neocoretechs/rknpu2
https://github.com/neocoretechs/rknn4j

neocoretechs May 18, 2025

I thought I might write a corollary to the above, as in; why didnt I patch in the CUDA matrixDotProduct to the project already, seems like a no-brainer right? Thing is check this out:

C:\Users\Jon Groff\eclipse-workspace\jcudatest>C:\Progra~1\Java\graalvm-jdk-25+20.1\bin\java -server -Xmx26g -cp bin;C:/users/share/jars/NeuroVolve.jar -DNeurovolve.properties=C:/etc/images/xferlearn.properties jcudatest.jcublas.samples.cublasDgemmBatched
Loading properties:C:/etc/images/xferlearn.properties
Examining:C:\usr\lib\jni\matrixDotProduct.dll
Trying load for:C:\usr\lib\jni\matrixDotProduct.dll
WARNING: A restricted method in java.lang.System has been called
WARNING: java.lang.System::load has been called by com.neocoretechs.neurovolve.properties.LibUtils in an unnamed module(file:/C:/Users/Share/Jars/NeuroVolve.jar)
WARNING: Use --enable-native-access=ALL-UNNAMED to avoid a warning for callers in this module
WARNING: Restricted methods will be blocked in a future release unless native access is enabled

CUBLAS creating handle...
CUBLAS handle created...
cublas Handle create=177964600 ns.
CUBLAS creating handle...
CUBLAS handle created...
cublas Handle create=239300 ns.
Cuda cublasDgemmBatch=15339.8625 ms.
dgemmJava Pure java dot product=9801.9701 ms.
Cuda cublasDgemmBatch cleanup=404300 ns.
Matrix Pure java dot product=9794.3155 ms.

neocoretechs · 2025-05-21T10:54:43Z

neocoretechs
May 21, 2025

Why does the llama model use a gpt2 tokenizer, but the gemma model use a llama tokenizer?
I recently tried to load the new Gemma gemma-1.1-2b-it.gguf just for laughs and of course it failed with "expected gpt2 but found llama" in the loadVocabulary method. It did, however, get through parsing! I know you said you are working on a new version that handles other models, but I am wondering what changes beside vocabulary might be necessary, and is the llama tokenizer available? Thanks!

P.S. I also notice the code seems like it supports FP16 and BF16 but only the Q4 and Q8 quantizations are recommended? Is this the case and the case with the new version?

0 replies

mukel · 2025-05-21T12:33:26Z

mukel
May 21, 2025
Maintainer

There are two tokenizers flavors, gpt2 (e.g. TikToken) and llama (used in the early Llama models), note that Llama 3 already uses a gpt2 tokenizer.

I can already run the latest versions of Llama (Meta), Mistral, Gemma (Google), Qwen (Alibaba), Phi (Microsoft), Granite (IBM), SmoLM (HugginggFace), the DeepSeek R1 distills ... I added support for F16 and BF16 just to compare against the quantizations.

My original goal was to implement and release all the building blocks needed to consume/run local models using Java e.g. a fast inference engine, GGUF/Safetensors model format parsers, tokenizers, a tiny tensor library ... but I only managed to complete and release the GGUF library. I also wrote a fast tokenizer library (faster than OpenAI's TikToken) but couldn't release it ... I got in trouble with my employer for this, so I've completely stopped working on it. I'll keep my weekend projects private for now.

Here's a rough draft of the original project goals:

https://llama4j.com

Here's an old implementation of Mistral.java with the llama tokenizer.

3 replies

neocoretechs May 21, 2025

Ok, thanks. My goal is to not waste effort hacking something together that you are going to release a cleaner implementation of any time soon. For instance, are you planning on providing foreign function interface callouts for native GPU processing crosscuts? I have that CUDA matmul code that is only limited by JNI and it would benefit from conversion to FFI but I would want to use whatever you are planning so it fits right in with the project.

neocoretechs May 22, 2025

Ok, I integrated MistralTokenizer and MistralChatFormat into the new code codebase via 2 new interfaces to abstract away tokenization and chatformatting. Still one big file! Tested on Mistral-7B-Instruct-v0.3.Q8_0. Its bombing on some larger models with memory map slice out of bounds error but that may be due to insufficient memory. I have to port to my 128gig Ubuntu machine. Next experiment is using my DB to store the RMS normalized inference just before logits and perform cosine similarity to attempt to expand the context infinitely!

neocoretechs May 23, 2025

Well, I tried gemma and devstral on the 128gig machine and I consistently get this exception. Does it ring any bells? Thanks.

Exception in thread "main" java.lang.IndexOutOfBoundsException: Out of bound access on segment MemorySegment{ kind: mapped, address: 0x1ac83003360, byteSize: 10025738208 }; new offset = 10025730048; new length = 8192
        at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.outOfBoundException(AbstractMemorySegmentImpl.java:433)
        at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.apply(AbstractMemorySegmentImpl.java:414)
        at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.apply(AbstractMemorySegmentImpl.java:70)
        at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98)
        at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:124)
        at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:448)
        at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.checkBounds(AbstractMemorySegmentImpl.java:403)
        at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.asSlice(AbstractMemorySegmentImpl.java:103)
        at java.base/jdk.internal.foreign.MappedMemorySegmentImpl.asSlice(MappedMemorySegmentImpl.java:64)
        at java.base/jdk.internal.foreign.MappedMemorySegmentImpl.asSlice(MappedMemorySegmentImpl.java:39)
        at com.llama4j.GGUF.loadTensors(Llama3.java:614)
        at com.llama4j.ModelLoader.loadModel(Llama3.java:926)
        at com.llama4j.ModelLoader.loadModel(Llama3.java:900)
        at com.llama4j.Llama3.main(Llama3.java:478)

mikepapadim · 2025-05-21T14:43:18Z

mikepapadim
May 21, 2025
Collaborator

@mukel @neocoretechs regarding GPU support for LLAMA3.java. We plan to release as soon as next week a repo based on Llama3.java that uses TornadoVM to offload the whole transormer architecture for inference on GPUs through Java. For v1.0 all Llama3 models that @mukel provided with be supported for Q4 and Q8. Initial support will inlcude optimized OpenCL and PTX support for Nvidia GPUs, and OpenCL support for Apple M-silicon,

0 replies

mikepapadim · 2025-05-30T09:35:21Z

mikepapadim
May 30, 2025
Collaborator

We are excited to share https://github.com/beehive-lab/GPULlama3.java that builds on @mukel Llama3.java and enables GPUs acceleration through TornadoVM.

3 replies

neocoretechs May 31, 2025

Any plans on integrating the new version to use that VM? Seems like some features of the newer version are being used, but then remnants of the older version that were upgraded in the new version are still present. I have the new version working with Mistral by integrating the old Mistral runner to the new version and added interfaces to more easily integrate other models, still keeping with the "no-unsafe, one-file" concept. Thanks.

mikepapadim May 31, 2025
Collaborator

We have a version working with Mistral as well, we plan to release it next week as we are currently in testing process to collect numbers of various NVidia GPUs. For which features you interested in? We can backport them to be in par. The issue is that we had to refactor llama3.java in a way to be easier to workout a distinct path to the GPU.

neocoretechs May 31, 2025

Well, the /context command is missing, and the BF16FloatTensor class is missing. Although BF16 wasnt that well supported even in the new version.

RAG ? #4

Uh oh!

Replies: 6 comments · 9 replies

Uh oh!

Uh oh!

Uh oh!

mukel May 14, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mukel May 21, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mikepapadim May 21, 2025 Collaborator

Uh oh!

mikepapadim May 30, 2025 Collaborator

Uh oh!

Uh oh!

mikepapadim May 31, 2025 Collaborator

Uh oh!

Replies: 6 comments 9 replies

mukel May 14, 2025
Maintainer

mukel
May 21, 2025
Maintainer

mikepapadim
May 21, 2025
Collaborator

mikepapadim
May 30, 2025
Collaborator

mikepapadim May 31, 2025
Collaborator