Summary
signed vs. unsigned
integer overflow in llama.cpp
's tokenizer implementation (llama_vocab::tokenize
) (src/llama-vocab.cpp:3036
) results in unintended behavior in tokens copying size comparison. Allowing heap-overflowing llama.cpp
inferencing engine with carefully manipulated text
input (human messages, prompts, template), during tokenization process.
This is pretty dangerous because the llama_vocab::tokenize
is used everywhere with human input (llama_vocab::tokenize -> llama_tokenize -> tokenize_prompt
, then generate
... ), meaning that every token input can be a vulnerable entry (Affects all user input (prompts, messages, templates)). However, considered less dangerous because most msg.role + content
are initialized with std::vector buf(alloc_size)
(common/chat.cpp:1831
), which have built-in implementation of prevention for > max_size()
*(what(): cannot create std::vector larger than max size()
);.
Nevertheless, during research, it was found that this is bypassable by exploiting the latest jinja
templates support. (common_chat_templates_apply_jinja
, as it inherits the memory space of tmpl
) (tmpl.apply
).
Details
A single line in llama_vocab::tokenize
, llama.cpp
's tokenizer implementation causes this vulnerability. Before we dissect how this heap overflow forms, let's look in to how is it used and referenced in the tokenization process.
// src/llama-vocab.cpp:3055-3073
int32_t llama_vocab::tokenize(
const char * text,
int32_t text_len,
llama_token * tokens,
int32_t n_tokens_max,
bool add_special,
bool parse_special) const {
auto res = tokenize(std::string(text, text_len), add_special, parse_special);
if (n_tokens_max < (int) res.size()) {
// LLAMA_LOG_ERROR("%s: too many tokens\n", __func__);
return -((int) res.size());
}
for (size_t i = 0; i < res.size(); i++) {
tokens[i] = res[i];
}
return res.size();
}
tokenize
philosophy
llama_vocab::tokenize()
acts as an interface adapter that calls the underlying tokenize
(llama_vocab::impl::tokenize
), which is the lower-part of the tokenization process, where the inner vocab
-ing is involved (e.g. LLAMA_VOCAB_TYPE_*
, determined by tokenizer.ggml.model
), (you will see later why is it design in this specific way). We won't dive into llama_vocab::impl::tokenize
now, since it's implementations don't matters now. (we explains later on why it generates another stack-overflow
)
int32_t llama_tokenize(
const struct llama_vocab * vocab,
// ....
return vocab->tokenize(text, text_len, tokens, n_tokens_max, add_special, parse_special);
}
llama_tokenize
thin wraps vocab->tokenize
(llama_vocab::tokenize
interface), also the common tokenizer
API
you'll see a lot in llama.cpp
's implementation, directly used in run/run.cpp
(./bin/llama-run
's implementations) or in common.cpp
(./common/common.cpp
, then used everywhere e.g. server.cpp
(./bin/llama-server
), tts.cpp
, tokenize.cpp
... ).
std::vector<llama_token> common_tokenize(
const struct llama_vocab * vocab,
const std::string & text,
bool add_special,
bool parse_special) {
// upper limit for the number of tokens
int n_tokens = text.length() + 2 * add_special;
std::vector<llama_token> result(n_tokens);
n_tokens = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
if (n_tokens < 0) {
result.resize(-n_tokens);
int check = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
GGML_ASSERT(check == -n_tokens);
//...
static int tokenize_prompt(const llama_vocab * vocab, const std::string & prompt,
std::vector<llama_token> & prompt_tokens, const LlamaData & llama_data) {
const bool is_first = llama_memory_seq_pos_max(llama_get_memory(llama_data.context.get()), 0) == 0;
const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, is_first, true);
prompt_tokens.resize(n_prompt_tokens);
if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), is_first,
true) < 0) {
printe("failed to tokenize the prompt\n");
return -1;
}
If you look close into the two implementations, you will see that both caller of llama_tokenize()
adheres to a common design for the allocation of the tokenization
process:
- Initialize the buffer for
llama_token * tokens
(result
) with a smaller allocation text.length() + 2 * add_special
/ prompt_tokens
(std::vector<llama_token> tokens;
))
- Calling
llama_tokenize -> llama_vocab::impl::tokenize
for the first time for probing the tokens
(res
) length, where the n_tokens_max
is set to zero
or a smaller size to guarantee no actual copying of the result happens
resize()
the result
vector with the negative length returned from llama_tokenize
- Calling
llama_tokenize
for the second time, where this time llama_vocab::impl::tokenize
is guaranteed to be saved under llama_token * tokens
This explains why a negative return is constructed for llama_tokenize
, the tokenizer
dynamically determine the outputting saving size of the token
array, though at a cost of calling llama_vocab::impl::tokenize
twice, this in first hand guaranteed efficient memory usage. But this is the cause for this heap-overflow
if (n_tokens_max < (int) res.size())
converts the tokenize(...).size()
(std::vector.size()
,size_t
) into (int)
for cases where the size of the tokenized vector exceeds n_tokens_max
(interpreted as n_tokens_max
as an argument).
The casting here intuitively makes sense, since n_tokens_max
is int32_t
- signed (as you can see the typing right above), the result size()
was cast to a signed int
to avoid the compiler warning about signed/unsigned comparison and ensure both operands have the same signedness
during the comparison operation.
However, this intuitive operation opened up a path for out-of-bound memory corruption at the same time. In an edge case where res.size()
exceeds INT_MAX (2,147,483,647)
, the casting will convert the originally huge size_t
res.size()
into a extremely large negative integer
, which will always bypass the signed size comparison for n_tokens_max
- which in normal sense is always a small integer (as we introduced previously, the dynamic size probing design will start then_tokens_max
at zero).
For the following-up memory operation, the originally int
casted res.size()
will be restore back to it's original typing size_t
, from the negative integer llama_vocab::tokenize(
used in size comparison back to the huge positive integer in size_t
, in case where res.size() = 2,147,483,647+1
, this will allows a (actual_tokens-2,147,483,648)*sizeof(llama_token)
bytes of out-of-bound writing of token.
From gdb
, we can see that the finally copied destination token
is located on the heap
, showing this is a heap-overflow
, we will explain later why this is interesting and fun (dangerous).
std::vector larger than max size()
?
However, notice that this huge sizing of a variable, specifically text
in this case, is usually problematic, since the cpp
standard library has preventions for you from creating such big elements. This was a major obstacle met during the process of creating a proof-of-concept for this heap overflow, since directly inputting such a lengthy prompt will trigger "what(): cannot create std::vector larger than max size()
"; however this limitation was bypassed.
Researching for the exact trigger for this error, it was found that this message is triggered by std::vector<char> buf(alloc_size)
, called as follows:
(tools/run/run.cpp:1179) static int chat_loop -> ret = process_user_message(opt,
(tools/run/run.cpp:1151) process_user_message -> apply_chat_template_with_error_handling(chat_templates.get(),
(tools/run/run.cpp:1082) apply_chat_template_with_error_handling -> apply_chat_template(tmpls, llama_data, append, use_jinja);
(tools/run/run.cpp:931) apply_chat_template -> common_chat_templates_apply(tmpls, inputs);
(common/chat.cpp:1867) common_chat_templates_apply -> common_chat_templates_apply_legacy
(common/chat.cpp:1831) common_chat_templates_apply_legacy -> std::vector<char> buf(alloc_size);
looking into (common/chat.cpp:1831) common_chat_templates_apply_legacy
:
static common_chat_params common_chat_templates_apply_legacy(
const struct common_chat_templates * tmpls,
const struct common_chat_templates_inputs & inputs)
{
// ....
for (size_t i = 0; i < contents.size(); ++i) {
const auto & msg = inputs.messages[i];
const auto & content = contents[i];
chat.push_back({msg.role.c_str(), content.c_str()});
alloc_size += (msg.role.size() + content.size()) * 1.25;
}
std::vector<char> buf(alloc_size);
The size here is determined by alloc_size += (msg.role.size() + content.size()) * 1.25,
the implementation for applying the chat.template
with the message's role
and message
. It's a pain here since the size was * 1.25
after adding the msg.role.size()
, making the original huge content.size()
(message) even bigger.
However, looking back at (common/chat.cpp:1867) common_chat_templates_apply
, where common_chat_templates_apply_legacy
is called, we can see another chat_templates_applier
:
common_chat_params common_chat_templates_apply(
const struct common_chat_templates * tmpls,
const struct common_chat_templates_inputs & inputs)
{
GGML_ASSERT(tmpls != nullptr);
return inputs.use_jinja
? common_chat_templates_apply_jinja(tmpls, inputs)
: common_chat_templates_apply_legacy(tmpls, inputs);
}
jinja
is llama.cpp
's chat template
interpreter is based, by looking it to common_chat_templates_apply_jinja
's implementations, we will see that it never allocates a manual byte-buffer the way the legacy
path does,
- It builds a
templates_params params
; structure (all members are default-constructed; nothing is pre-sized).
- Depending on the template in use it dispatches to one of the
common_chat_params_init_*
helpers (e.g. common_chat_params_init_llama_3_x, *_generic
, …).
- Inside those helpers the rendered prompt is obtained with
data.prompt = apply(tmpl, tweaked_messages, tools_json, add_generation_prompt, extra_context);
where apply(...)
is the small helper a few lines above. That helper calls
auto result = tmpl.apply(tmpl_inputs, tmpl_opts); // minja::chat_template::apply
minja::chat_template::apply
directly returns an std::string
, so the prompt is produced and stored in a normal C++ string. Memory management is therefore handled automatically by std::string
; no explicit size estimation or buffer reallocation is required, what that mean using common_chat_templates_apply_jinja
allow us to use the original constructed message
, and not trigger any size error.
(./bin/llama-run
):
(src/llama-vocab.cpp:3331) int32_t llama_tokenize() -> vocab->tokenize(
(tools/run/run.cpp:944) tokenize_prompt -> const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, is_first, true);
- prompt (reversed):
(tools/run/run.cpp:988) static int generate(
-> if (tokenize_prompt(vocab, prompt, tokens, llama_data) < 0)
(tools/run/run.cpp:1063) static int generate_response -> if (generate(llama_data, prompt, response))
(tools/run/run.cpp:1151) static int process_user_message( -> if (generate_response(llama_data, prompt, response, stdout_a_terminal)) {
(tools/run/run.cpp:1179) static int chat_loop
Collateral Gift
During the process of creating a PoC
for previously mentioned vulnerability and bypassing vector, something sketchy caught attention when examine the ASAN
logs.

A stack-overflow
was triggered via the STL
allocator (bits/alloc_traits.h)
(this is common for ASAN
), at first we thought this is the direct proof-of-concept for our overflow discussed above (didn't realize it was actually a heap-overflow
back then), but then looking into the detailed ASAN
logs, it was realized that this was via regex
processing (bits/regex_executor.tcc
), via sub_match
, with further investigations on the overflowing frame, it's found that this stack-overflow
was caused by a infinite recursion triggered by unicode_regex_split
cause the stack frame to raise to the upper limit of stack region, and triggered this oob
access detected by asan
, specifically:
llama_vocab::impl::tokenize(
case LLAMA_VOCAB_TYPE_BPE:
session.tokenize(text, output) -> void tokenize()
src/llama-vocab.cpp:484
const auto word_collection = unicode_regex_split(text, tokenizer.regex_exprs);
void tokenize(const std::string & text, std::vector<llama_token> & output) {
int final_prev_index = -1;
const auto word_collection = unicode_regex_split(text, tokenizer.regex_exprs);
You can takes this in two perspective. On one hand, this give us a collateral ReDoS
out-of-the-blue; on the other hand, this collateral stack-overflow
stops us from reaching the final heap-overflow
.
However, there's always a way to bypass, this method of word splitting (unicode_regex_split
) only happens in LLAMA_VOCAB_TYPE_BPE
, the most common vocab_type
used by gpt-2
(else if (tokenizer_model == "gpt2") { type = LLAMA_VOCAB_TYPE_BPE;
) or (Byte-Pair Encoding)
. By switching to Unigram (T5)
architectures in the GGUF
metadata, (LLAMA_VOCAB_TYPE_UGM
), we can take the other case
in the llama_vocab::impl::tokenize()
(get_type())
switch.

Proof-of-Concept
- Compile latest version of
llama.cpp
with ASAN
:
cmake .. \
-DCMAKE_C_FLAGS="-fsanitize=address -fno-omit-frame-pointer -g" \
-DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer -g"
make -j
- Generate a
prompt
, tokenized result hitting >INT32_MAX
incorporating the size of chat.template
(In .gguf
metadata):
perl -e 'print "<token>" x ((2147483648-<chat-template-size>)/<per_token>), "\n"' >| prompt.txt
- Start a
llama.cpp
inferencing service (we choose llama-run
as poc
), input (redirect) prompt
as model input to trigger tokenization
.
ASAN_OPTIONS=verbosity=1 \
./bin/llama-run file://<path-to-model> --jinja < ./prompt.txt
Impact
heap overflow
(heap based out-of-bounds writing) of the llama.cpp
inferencing engine.
- potential
remote-code execution
: the heap is very playful, we're able to overwrite following chunks (freed
or in-use
, both dangerous!) member's pointers, we could:
- overwrite
in-use
structure members: e.g. change initialized chunk interface to bad pointers, hijack execution flow, structure-oriented programming?
- *you can read llama's paradox for my past experience turning a heap-overflow in
llama.cpp
to rce
.
- overwrite chunks states / freed chunks pointers: e.g. house-of attacks
dos
: crashes the inferencing server (straightforward)
Impacted Components:
llama_tokenize() -> llama_vocab::tokenize()
run.cpp
(./bin/llama-run
)
simple.cpp
(./bin/llama-simple
)
Summary
signed vs. unsigned
integer overflow inllama.cpp
's tokenizer implementation (llama_vocab::tokenize
) (src/llama-vocab.cpp:3036
) results in unintended behavior in tokens copying size comparison. Allowing heap-overflowingllama.cpp
inferencing engine with carefully manipulatedtext
input (human messages, prompts, template), during tokenization process.This is pretty dangerous because the
llama_vocab::tokenize
is used everywhere with human input (llama_vocab::tokenize -> llama_tokenize -> tokenize_prompt
, thengenerate
... ), meaning that every token input can be a vulnerable entry (Affects all user input (prompts, messages, templates)). However, considered less dangerous because mostmsg.role + content
are initialized withstd::vector buf(alloc_size)
(common/chat.cpp:1831
), which have built-in implementation of prevention for> max_size()
*(what(): cannot create std::vector larger than max size()
);.Nevertheless, during research, it was found that this is bypassable by exploiting the latest
jinja
templates support. (common_chat_templates_apply_jinja
, as it inherits the memory space oftmpl
) (tmpl.apply
).Details
A single line in
llama_vocab::tokenize
,llama.cpp
's tokenizer implementation causes this vulnerability. Before we dissect how this heap overflow forms, let's look in to how is it used and referenced in the tokenization process.tokenize
philosophyllama_vocab::tokenize()
acts as an interface adapter that calls the underlyingtokenize
(llama_vocab::impl::tokenize
), which is the lower-part of the tokenization process, where the innervocab
-ing is involved (e.g.LLAMA_VOCAB_TYPE_*
, determined bytokenizer.ggml.model
), (you will see later why is it design in this specific way). We won't dive intollama_vocab::impl::tokenize
now, since it's implementations don't matters now. (we explains later on why it generates anotherstack-overflow
)llama_tokenize
thin wrapsvocab->tokenize
(llama_vocab::tokenize
interface), also the commontokenizer
API
you'll see a lot inllama.cpp
's implementation, directly used inrun/run.cpp
(./bin/llama-run
's implementations) or incommon.cpp
(./common/common.cpp
, then used everywhere e.g.server.cpp
(./bin/llama-server
),tts.cpp
,tokenize.cpp
... ).If you look close into the two implementations, you will see that both caller of
llama_tokenize()
adheres to a common design for the allocation of thetokenization
process:llama_token * tokens
(result
) with a smaller allocationtext.length() + 2 * add_special
/prompt_tokens
(std::vector<llama_token> tokens;
))llama_tokenize -> llama_vocab::impl::tokenize
for the first time for probing thetokens
(res
) length, where then_tokens_max
is set tozero
or a smaller size to guarantee no actual copying of the result happensresize()
theresult
vector with the negative length returned fromllama_tokenize
llama_tokenize
for the second time, where this timellama_vocab::impl::tokenize
is guaranteed to be saved underllama_token * tokens
This explains why a negative return is constructed for
llama_tokenize
, thetokenizer
dynamically determine the outputting saving size of thetoken
array, though at a cost of callingllama_vocab::impl::tokenize
twice, this in first hand guaranteed efficient memory usage. But this is the cause for this heap-overflowif (n_tokens_max < (int) res.size())
converts thetokenize(...).size()
(std::vector.size()
,size_t
) into(int)
for cases where the size of the tokenized vector exceedsn_tokens_max
(interpreted asn_tokens_max
as an argument).The casting here intuitively makes sense, since
n_tokens_max
isint32_t
- signed (as you can see the typing right above), theresult size()
was cast to a signedint
to avoid the compiler warning about signed/unsigned comparison and ensure both operands have the samesignedness
during the comparison operation.However, this intuitive operation opened up a path for out-of-bound memory corruption at the same time. In an edge case where
res.size()
exceedsINT_MAX (2,147,483,647)
, the casting will convert the originally hugesize_t
res.size()
into a extremely large negativeinteger
, which will always bypass the signed size comparison forn_tokens_max
- which in normal sense is always a small integer (as we introduced previously, the dynamic size probing design will start then_tokens_max
at zero).For the following-up memory operation, the originally
int
castedres.size()
will be restore back to it's original typingsize_t
, from the negative integerllama_vocab::tokenize(
used in size comparison back to the huge positive integer insize_t
, in case whereres.size() = 2,147,483,647+1
, this will allows a(actual_tokens-2,147,483,648)*sizeof(llama_token)
bytes of out-of-bound writing of token.From
gdb
, we can see that the finally copied destinationtoken
is located on theheap
, showing this is aheap-overflow
, we will explain later why this is interesting and fun (dangerous).std::vector larger than max size()
?However, notice that this huge sizing of a variable, specifically
text
in this case, is usually problematic, since thecpp
standard library has preventions for you from creating such big elements. This was a major obstacle met during the process of creating a proof-of-concept for this heap overflow, since directly inputting such a lengthy prompt will trigger "what(): cannot create std::vector larger than max size()
"; however this limitation was bypassed.Researching for the exact trigger for this error, it was found that this message is triggered by
std::vector<char> buf(alloc_size)
, called as follows:(tools/run/run.cpp:1179) static int chat_loop -> ret = process_user_message(opt,
(tools/run/run.cpp:1151) process_user_message -> apply_chat_template_with_error_handling(chat_templates.get(),
(tools/run/run.cpp:1082) apply_chat_template_with_error_handling -> apply_chat_template(tmpls, llama_data, append, use_jinja);
(tools/run/run.cpp:931) apply_chat_template -> common_chat_templates_apply(tmpls, inputs);
(common/chat.cpp:1867) common_chat_templates_apply -> common_chat_templates_apply_legacy
(common/chat.cpp:1831) common_chat_templates_apply_legacy -> std::vector<char> buf(alloc_size);
looking into
(common/chat.cpp:1831) common_chat_templates_apply_legacy
:The size here is determined by
alloc_size += (msg.role.size() + content.size()) * 1.25,
the implementation for applying thechat.template
with the message'srole
andmessage
. It's a pain here since the size was* 1.25
after adding themsg.role.size()
, making the original hugecontent.size()
(message) even bigger.However, looking back at
(common/chat.cpp:1867) common_chat_templates_apply
, wherecommon_chat_templates_apply_legacy
is called, we can see anotherchat_templates_applier
:jinja
isllama.cpp
'schat template
interpreter is based, by looking it tocommon_chat_templates_apply_jinja
's implementations, we will see that it never allocates a manual byte-buffer the way thelegacy
path does,templates_params params
; structure (all members are default-constructed; nothing is pre-sized).common_chat_params_init_*
helpers (e.g.common_chat_params_init_llama_3_x, *_generic
, …).where
apply(...)
is the small helper a few lines above. That helper callsminja::chat_template::apply
directly returns anstd::string
, so the prompt is produced and stored in a normal C++ string. Memory management is therefore handled automatically bystd::string
; no explicit size estimation or buffer reallocation is required, what that mean usingcommon_chat_templates_apply_jinja
allow us to use the original constructedmessage
, and not trigger any size error.(
./bin/llama-run
):(src/llama-vocab.cpp:3331) int32_t llama_tokenize() -> vocab->tokenize(
(tools/run/run.cpp:944) tokenize_prompt -> const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, is_first, true);
(tools/run/run.cpp:988) static int generate(
->if (tokenize_prompt(vocab, prompt, tokens, llama_data) < 0)
(tools/run/run.cpp:1063) static int generate_response -> if (generate(llama_data, prompt, response))
(tools/run/run.cpp:1151) static int process_user_message( -> if (generate_response(llama_data, prompt, response, stdout_a_terminal)) {
(tools/run/run.cpp:1179) static int chat_loop
Collateral Gift
During the process of creating a

PoC
for previously mentioned vulnerability and bypassing vector, something sketchy caught attention when examine theASAN
logs.A
stack-overflow
was triggered via theSTL
allocator (bits/alloc_traits.h)
(this is common forASAN
), at first we thought this is the direct proof-of-concept for our overflow discussed above (didn't realize it was actually aheap-overflow
back then), but then looking into the detailedASAN
logs, it was realized that this was viaregex
processing (bits/regex_executor.tcc
), viasub_match
, with further investigations on the overflowing frame, it's found that thisstack-overflow
was caused by a infinite recursion triggered byunicode_regex_split
cause the stack frame to raise to the upper limit of stack region, and triggered thisoob
access detected byasan
, specifically:llama_vocab::impl::tokenize(
case LLAMA_VOCAB_TYPE_BPE:
session.tokenize(text, output) -> void tokenize()
src/llama-vocab.cpp:484
const auto word_collection = unicode_regex_split(text, tokenizer.regex_exprs);
You can takes this in two perspective. On one hand, this give us a collateral
ReDoS
out-of-the-blue; on the other hand, this collateralstack-overflow
stops us from reaching the finalheap-overflow
.However, there's always a way to bypass, this method of word splitting (

unicode_regex_split
) only happens inLLAMA_VOCAB_TYPE_BPE
, the most commonvocab_type
used bygpt-2
(else if (tokenizer_model == "gpt2") { type = LLAMA_VOCAB_TYPE_BPE;
) or (Byte-Pair Encoding)
. By switching toUnigram (T5)
architectures in theGGUF
metadata, (LLAMA_VOCAB_TYPE_UGM
), we can take the othercase
in thellama_vocab::impl::tokenize()
(get_type())
switch.Proof-of-Concept
llama.cpp
withASAN
:prompt
, tokenized result hitting>INT32_MAX
incorporating the size ofchat.template
(In.gguf
metadata):llama.cpp
inferencing service (we choosellama-run
aspoc
), input (redirect)prompt
as model input to triggertokenization
.gguf
models withtokenizer.ggml.model
that's notgpt-2
, withjinja
supported template (e.g. : Retr0REG/mistral-tokenizer-llama)Impact
heap overflow
(heap based out-of-bounds writing) of thellama.cpp
inferencing engine.remote-code execution
: the heap is very playful, we're able to overwrite following chunks (freed
orin-use
, both dangerous!) member's pointers, we could:in-use
structure members: e.g. change initialized chunk interface to bad pointers, hijack execution flow, structure-oriented programming?llama.cpp
torce
.dos
: crashes the inferencing server (straightforward)Impacted Components:
llama_tokenize() -> llama_vocab::tokenize()
run.cpp
(./bin/llama-run
)simple.cpp
(./bin/llama-simple
)