Tokenizer signed vs. unsigned heap overflow

Summary

signed vs. unsigned integer overflow in llama.cpp's tokenizer implementation (llama_vocab::tokenize) (src/llama-vocab.cpp:3036) results in unintended behavior in tokens copying size comparison. Allowing heap-overflowing llama.cpp inferencing engine with carefully manipulated text input (human messages, prompts, template), during tokenization process.

This is pretty dangerous because the llama_vocab::tokenize is used everywhere with human input (llama_vocab::tokenize -> llama_tokenize -> tokenize_prompt, then generate... ), meaning that every token input can be a vulnerable entry (Affects all user input (prompts, messages, templates)). However, considered less dangerous because most msg.role + content are initialized with std::vector buf(alloc_size) (common/chat.cpp:1831), which have built-in implementation of prevention for > max_size() *(what(): cannot create std::vector larger than max size());.

Nevertheless, during research, it was found that this is bypassable by exploiting the latest jinja templates support. (common_chat_templates_apply_jinja, as it inherits the memory space of tmpl) (tmpl.apply).

Details

A single line in llama_vocab::tokenize, llama.cpp's tokenizer implementation causes this vulnerability. Before we dissect how this heap overflow forms, let's look in to how is it used and referenced in the tokenization process.

// src/llama-vocab.cpp:3055-3073
int32_t llama_vocab::tokenize(
                  const char * text,
                     int32_t   text_len,
                 llama_token * tokens,
                     int32_t   n_tokens_max,
                        bool   add_special,
                        bool   parse_special) const {
    auto res = tokenize(std::string(text, text_len), add_special, parse_special);
    if (n_tokens_max < (int) res.size()) {
        // LLAMA_LOG_ERROR("%s: too many tokens\n", __func__);
        return -((int) res.size());
    }

    for (size_t i = 0; i < res.size(); i++) {
        tokens[i] = res[i];
    }

    return res.size();
}

`tokenize` philosophy

llama_vocab::tokenize() acts as an interface adapter that calls the underlying tokenize (llama_vocab::impl::tokenize), which is the lower-part of the tokenization process, where the inner vocab-ing is involved (e.g. LLAMA_VOCAB_TYPE_*, determined by tokenizer.ggml.model), (you will see later why is it design in this specific way). We won't dive into llama_vocab::impl::tokenize now, since it's implementations don't matters now. (we explains later on why it generates another stack-overflow)

int32_t llama_tokenize(
    const struct llama_vocab * vocab,
    // ....
    return vocab->tokenize(text, text_len, tokens, n_tokens_max, add_special, parse_special);
}

llama_tokenize thin wraps vocab->tokenize (llama_vocab::tokenize interface), also the common tokenizer API you'll see a lot in llama.cpp's implementation, directly used in run/run.cpp(./bin/llama-run's implementations) or in common.cpp (./common/common.cpp, then used everywhere e.g. server.cpp (./bin/llama-server), tts.cpp, tokenize.cpp... ).

std::vector<llama_token> common_tokenize(
    const struct llama_vocab * vocab,
           const std::string & text,
                        bool   add_special,
                        bool   parse_special) {
    // upper limit for the number of tokens
    int n_tokens = text.length() + 2 * add_special;
    std::vector<llama_token> result(n_tokens);
    n_tokens = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
    if (n_tokens < 0) {
        result.resize(-n_tokens);
        int check = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
        GGML_ASSERT(check == -n_tokens);

//...

static int tokenize_prompt(const llama_vocab * vocab, const std::string & prompt,
                           std::vector<llama_token> & prompt_tokens, const LlamaData & llama_data) {
    const bool is_first = llama_memory_seq_pos_max(llama_get_memory(llama_data.context.get()), 0) == 0;

    const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, is_first, true);
    prompt_tokens.resize(n_prompt_tokens);
    if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), is_first,
                       true) < 0) {
        printe("failed to tokenize the prompt\n");
        return -1;
    }

If you look close into the two implementations, you will see that both caller of llama_tokenize() adheres to a common design for the allocation of the tokenization process:

Initialize the buffer for llama_token * tokens (result) with a smaller allocation text.length() + 2 * add_special / prompt_tokens (std::vector<llama_token> tokens;))
1. Calling llama_tokenize -> llama_vocab::impl::tokenize for the first time for probing the tokens (res) length, where the n_tokens_max is set to zero or a smaller size to guarantee no actual copying of the result happens
2. resize() the result vector with the negative length returned from llama_tokenize
3. Calling llama_tokenize for the second time, where this time llama_vocab::impl::tokenize is guaranteed to be saved under llama_token * tokens

This explains why a negative return is constructed for llama_tokenize, the tokenizer dynamically determine the outputting saving size of the token array, though at a cost of calling llama_vocab::impl::tokenize twice, this in first hand guaranteed efficient memory usage. But this is the cause for this heap-overflow

if (n_tokens_max < (int) res.size()) converts the tokenize(...).size() (std::vector.size(),size_t) into (int) for cases where the size of the tokenized vector exceeds n_tokens_max (interpreted as n_tokens_max as an argument).

The casting here intuitively makes sense, since n_tokens_max is int32_t - signed (as you can see the typing right above), the result size() was cast to a signed int to avoid the compiler warning about signed/unsigned comparison and ensure both operands have the same signedness during the comparison operation.

However, this intuitive operation opened up a path for out-of-bound memory corruption at the same time. In an edge case where res.size() exceeds INT_MAX (2,147,483,647), the casting will convert the originally huge size_t res.size() into a extremely large negative integer, which will always bypass the signed size comparison for n_tokens_max - which in normal sense is always a small integer (as we introduced previously, the dynamic size probing design will start then_tokens_max at zero).

For the following-up memory operation, the originally int casted res.size() will be restore back to it's original typing size_t, from the negative integer llama_vocab::tokenize( used in size comparison back to the huge positive integer in size_t, in case where res.size() = 2,147,483,647+1, this will allows a (actual_tokens-2,147,483,648)*sizeof(llama_token) bytes of out-of-bound writing of token.

From gdb, we can see that the finally copied destination token is located on the heap, showing this is a heap-overflow, we will explain later why this is interesting and fun (dangerous).

`std::vector larger than max size()`?

However, notice that this huge sizing of a variable, specifically text in this case, is usually problematic, since the cpp standard library has preventions for you from creating such big elements. This was a major obstacle met during the process of creating a proof-of-concept for this heap overflow, since directly inputting such a lengthy prompt will trigger "what(): cannot create std::vector larger than max size()"; however this limitation was bypassed.

Researching for the exact trigger for this error, it was found that this message is triggered by std::vector<char> buf(alloc_size), called as follows:

(tools/run/run.cpp:1179) static int chat_loop -> ret = process_user_message(opt,
- (tools/run/run.cpp:1151) process_user_message -> apply_chat_template_with_error_handling(chat_templates.get(),
  - (tools/run/run.cpp:1082) apply_chat_template_with_error_handling -> apply_chat_template(tmpls, llama_data, append, use_jinja);
    - (tools/run/run.cpp:931) apply_chat_template -> common_chat_templates_apply(tmpls, inputs);
      - (common/chat.cpp:1867) common_chat_templates_apply -> common_chat_templates_apply_legacy
        
        (common/chat.cpp:1831) common_chat_templates_apply_legacy -> std::vector<char> buf(alloc_size);

looking into (common/chat.cpp:1831) common_chat_templates_apply_legacy:

static common_chat_params common_chat_templates_apply_legacy(
    const struct common_chat_templates * tmpls,
    const struct common_chat_templates_inputs & inputs)
{
    // ....
    for (size_t i = 0; i < contents.size(); ++i) {
        const auto & msg = inputs.messages[i];
        const auto & content = contents[i];
        chat.push_back({msg.role.c_str(), content.c_str()});
        alloc_size += (msg.role.size() + content.size()) * 1.25;
    }

    std::vector<char> buf(alloc_size);

The size here is determined by alloc_size += (msg.role.size() + content.size()) * 1.25, the implementation for applying the chat.template with the message's role and message. It's a pain here since the size was * 1.25 after adding the msg.role.size(), making the original huge content.size() (message) even bigger.

However, looking back at (common/chat.cpp:1867) common_chat_templates_apply, where common_chat_templates_apply_legacy is called, we can see another chat_templates_applier:

common_chat_params common_chat_templates_apply(
    const struct common_chat_templates * tmpls,
    const struct common_chat_templates_inputs & inputs)
{
    GGML_ASSERT(tmpls != nullptr);
    return inputs.use_jinja
        ? common_chat_templates_apply_jinja(tmpls, inputs)
        : common_chat_templates_apply_legacy(tmpls, inputs);
}

jinja is llama.cpp's chat template interpreter is based, by looking it to common_chat_templates_apply_jinja's implementations, we will see that it never allocates a manual byte-buffer the way the legacy path does,

It builds a templates_params params; structure (all members are default-constructed; nothing is pre-sized).
Depending on the template in use it dispatches to one of the common_chat_params_init_* helpers (e.g. common_chat_params_init_llama_3_x, *_generic, …).
Inside those helpers the rendered prompt is obtained with

data.prompt = apply(tmpl, tweaked_messages, tools_json, add_generation_prompt, extra_context);

where apply(...) is the small helper a few lines above. That helper calls

auto result = tmpl.apply(tmpl_inputs, tmpl_opts); // minja::chat_template::apply

minja::chat_template::apply directly returns an std::string, so the prompt is produced and stored in a normal C++ string. Memory management is therefore handled automatically by std::string; no explicit size estimation or buffer reallocation is required, what that mean using common_chat_templates_apply_jinja allow us to use the original constructed message, and not trigger any size error.

(./bin/llama-run):

(src/llama-vocab.cpp:3331) int32_t llama_tokenize() -> vocab->tokenize(
- (tools/run/run.cpp:944) tokenize_prompt -> const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, is_first, true);
  - prompt (reversed): (tools/run/run.cpp:988) static int generate( -> if (tokenize_prompt(vocab, prompt, tokens, llama_data) < 0)
    - (tools/run/run.cpp:1063) static int generate_response -> if (generate(llama_data, prompt, response))
      - (tools/run/run.cpp:1151) static int process_user_message( -> if (generate_response(llama_data, prompt, response, stdout_a_terminal)) {
        
        (tools/run/run.cpp:1179) static int chat_loop

Collateral Gift

During the process of creating a PoC for previously mentioned vulnerability and bypassing vector, something sketchy caught attention when examine the ASAN logs.

A stack-overflow was triggered via the STL allocator (bits/alloc_traits.h) (this is common for ASAN), at first we thought this is the direct proof-of-concept for our overflow discussed above (didn't realize it was actually a heap-overflow back then), but then looking into the detailed ASAN logs, it was realized that this was via regex processing (bits/regex_executor.tcc), via sub_match, with further investigations on the overflowing frame, it's found that this stack-overflow was caused by a infinite recursion triggered by unicode_regex_split cause the stack frame to raise to the upper limit of stack region, and triggered this oob access detected by asan, specifically:

llama_vocab::impl::tokenize(
- case LLAMA_VOCAB_TYPE_BPE:
  - session.tokenize(text, output) -> void tokenize() src/llama-vocab.cpp:484
    - const auto word_collection = unicode_regex_split(text, tokenizer.regex_exprs);

    void tokenize(const std::string & text, std::vector<llama_token> & output) {
        int final_prev_index = -1;
        const auto word_collection = unicode_regex_split(text, tokenizer.regex_exprs);

You can takes this in two perspective. On one hand, this give us a collateral ReDoS out-of-the-blue; on the other hand, this collateral stack-overflow stops us from reaching the final heap-overflow.

However, there's always a way to bypass, this method of word splitting (unicode_regex_split) only happens in LLAMA_VOCAB_TYPE_BPE, the most common vocab_type used by gpt-2 (else if (tokenizer_model == "gpt2") { type = LLAMA_VOCAB_TYPE_BPE;) or (Byte-Pair Encoding). By switching to Unigram (T5) architectures in the GGUF metadata, (LLAMA_VOCAB_TYPE_UGM), we can take the other case in the llama_vocab::impl::tokenize() (get_type()) switch.

Proof-of-Concept

Compile latest version of llama.cpp with ASAN:

cmake .. \
    -DCMAKE_C_FLAGS="-fsanitize=address -fno-omit-frame-pointer -g" \
    -DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer -g"

make -j

Generate a prompt, tokenized result hitting >INT32_MAX incorporating the size of chat.template (In .gguf metadata):

perl -e 'print "<token>" x ((2147483648-<chat-template-size>)/<per_token>), "\n"' >| prompt.txt

Start a llama.cpp inferencing service (we choose llama-run as poc), input (redirect) prompt as model input to trigger tokenization.
- Use a gguf models with tokenizer.ggml.model that's not gpt-2, with jinja supported template (e.g. : Retr0REG/mistral-tokenizer-llama)

ASAN_OPTIONS=verbosity=1 \
./bin/llama-run file://<path-to-model> --jinja < ./prompt.txt

Impact

heap overflow (heap based out-of-bounds writing) of the llama.cpp inferencing engine.
- potential remote-code execution: the heap is very playful, we're able to overwrite following chunks (freed or in-use, both dangerous!) member's pointers, we could:
  - overwrite in-use structure members: e.g. change initialized chunk interface to bad pointers, hijack execution flow, structure-oriented programming?
    - *you can read llama's paradox for my past experience turning a heap-overflow in llama.cpp to rce.
  - overwrite chunks states / freed chunks pointers: e.g. house-of attacks
- dos: crashes the inferencing server (straightforward)

Impacted Components:

llama_tokenize() -> llama_vocab::tokenize()
- run.cpp (./bin/llama-run)
- simple.cpp (./bin/llama-simple)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer signed vs. unsigned heap overflow

Package

Affected versions

Patched versions

Description

Summary

Details

`tokenize` philosophy

`std::vector larger than max size()`?

Collateral Gift

Proof-of-Concept

Impact

Severity

CVSS overall score

CVSS v3 base metrics

CVSS v3 base metrics

CVE ID

Weaknesses

Improper Restriction of Operations within the Bounds of a Memory Buffer

Signed to Unsigned Conversion Error

Credits

Tokenizer signed vs. unsigned heap overflow

Package

Affected versions

Patched versions

Description

Summary

Details

tokenize philosophy

std::vector larger than max size()?

Collateral Gift

Proof-of-Concept

Impact

Severity

CVSS v3 base metrics

CVE ID

Weaknesses

Improper Restriction of Operations within the Bounds of a Memory Buffer

Signed to Unsigned Conversion Error

Credits

`tokenize` philosophy

`std::vector larger than max size()`?