Releases: ngxson/wllama
2.3.2
News
Important
🚀 This release marks a special event:
Firefox now official uses wllama as one of the inference engine in their Link Preview feature!
The Link Preview feature is currently available on Beta and Nightly build. You can find the upstream code here.
Read more in this blog: https://blog.mozilla.org/en/mozilla/ai/ai-tech/ai-link-previews-firefox/

What's Changed
Full Changelog: 2.3.1...2.3.2
2.3.1
2.3.0
What's Changed
You can now use the stream: true
option to get an AsyncIterator
:
const messages: WllamaChatMessage[] = [
{ role: 'system', content: 'You are helpful.' },
{ role: 'user', content: 'Hi!' },
{ role: 'assistant', content: 'Hello!' },
{ role: 'user', content: 'How are you?' },
];
const stream = await wllama.createChatCompletion(messages, {
nPredict: 10,
sampling: {
temp: 0.0,
},
stream: true, // ADD THIS
});
for await (const chunk of stream) {
console.log(chunk.currentText);
}
Additionally, you can also use AbortSignal
to stop a generation mid-way, much like how it's used in fetch
API. Here is an example:
const abortController = new AbortController();
const stream = await wllama.createChatCompletion(messages, {
abortSignal: abortController.signal, // ADD THIS
stream: true,
});
// call abortController.abort(); to abort it
// note: this can also be called during prompt processing
Gemma 3 support: With the up-to-date llama.cpp source code, you can now use Gemma 3 models!
- build single-file mjs + minified version by @ngxson in #161
- bump to latest upstream llama.cpp source code by @ngxson in #162
- add support for async generator by @ngxson in #163
- add "stream" option for AsyncIterator by @ngxson in #164
- add test for abortSignal by @ngxson in #165
- bump to latest upstream llama.cpp source code by @ngxson in #166
Full Changelog: 2.2.1...2.3.0
2.2.1
2.2.0
v2.2.0 - x2 speed for Qx_K and Qx_0 quantization
BIG release is dropped! Biggest changes including:
- x2 speed for Qx_K and Qx_0 quantization 🚀 ref this PR: ggml-org/llama.cpp#11453 (while it's not merged yet on upstream, I included it inside wllama as a patch) - IQx quants will still be slow, but upcoming work is already planned
- Switched to binary protocol for the connection between JS <==> WASM. The
json.hpp
dependency is now gone! Callingwllama.tokenize()
on a long text now faster than ever! 🎉
Debut at FOSDEM 2025
Last week, I gave a 15-minute talk at FOSDEM 2025 which, for the first time, introduces wllama to the real world!
Watch the talk here: https://fosdem.org/2025/schedule/event/fosdem-2025-5154-wllama-bringing-llama-cpp-to-the-web/
What's Changed
- add benchmark function, used internally by @ngxson in #151
- switch to binary protocol between JS and WASM world (glue.cpp) by @ngxson in #154
- Remove json.hpp dependency by @ngxson in #155
- temporary apply that viral x2 speedup PR by @ngxson in #156
- Fix a bug with kv_remove, release v2.2.0 by @ngxson in #157
Full Changelog: 2.1.3...2.2.0
2.1.4
2.1.3
What's Changed
Try it via the demo app: https://huggingface.co/spaces/ngxson/wllama

Full Changelog: 2.1.2...2.1.3
2.1.2
2.1.1
2.1.0
What's Changed
- added
createChatCompletion
--> #140
Example:
const messages: WllamaChatMessage[] = [
{ role: 'system', content: 'You are helpful.' },
{ role: 'user', content: 'Hi!' },
{ role: 'assistant', content: 'Hello!' },
{ role: 'user', content: 'How are you?' },
];
const completion = await wllama.createChatCompletion(messages, {
nPredict: 10,
});