how can I get the best inference speed in my situation #9503

FranzKafkaYu · 2024-09-16T03:38:45Z

FranzKafkaYu
Sep 16, 2024

Hello guys,I am working on llama.cpp in my Android device,and each time the inference will begin with the same pattern:

prompt.prefix + User Input + prompt.suffix

the prompt.prefix and prompt.suffix are both constant and won't change,the only changed is User Input.Currently I am using the code below which is from simple.cpp in examples:

extern "C"
JNIEXPORT jstring JNICALL
Java_cn_seres_llama_LlamaAndroid_inference_1handle(JNIEnv * env,jobject,jlong context_pointer,jlong batch_pointer,jstring jtext,jint n_len) {

    struct timespec startTime={0},endTime={0},firstTime={0};
    const auto text = env->GetStringUTFChars(jtext, 0);
    const auto batch = reinterpret_cast<llama_batch *>(batch_pointer);
    const auto context = reinterpret_cast<llama_context *>(context_pointer);
    const auto model = llama_get_model(context);
    clock_gettime(CLOCK_MONOTONIC,&startTime);
    //clear all cached tokens
    cached_token_chars.clear();

    //tokenize prompt
    const auto tokens_list = llama_tokenize(context, params.input_prefix + text + params.input_suffix, true, false);

    auto n_ctx = llama_n_ctx(context);
    auto n_kv_req = tokens_list.size() + (n_len - tokens_list.size());
    if (n_kv_req > n_ctx) {
        LOGe("error: n_kv_req > n_ctx, the required KV cache size is not big enough");
    }
    //for (auto id : tokens_list) {
    //    LOGi("%s", llama_token_to_piece(context, id).c_str());
    //}

    // evaluate the initial prompt
    for (auto i = 0; i < tokens_list.size(); i++) {
        llama_batch_add(*batch, tokens_list[i], i, { 0 }, false);
    }

    // llama_decode will output logits only for the last token of the prompt
    batch->logits[batch->n_tokens - 1] = true;

    if (llama_decode(context, *batch) != 0) {
        LOGe("llama_decode() failed");
    }

    //decode && sampling
    std::string result="";
    int n_decode = 0;
    int n_cur = batch->n_tokens;
    LOGi("input:%s,n_cur:%d ,n_len = %d, n_ctx = %d, n_kv_req = %d",text,n_cur, n_len, n_ctx, n_kv_req);
    while(n_cur <= n_len) {
        auto   n_vocab = llama_n_vocab(model);
        auto * logits  = llama_get_logits_ith(context, batch->n_tokens - 1);

        std::vector<llama_token_data> candidates;
        candidates.reserve(n_vocab);

        for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
                candidates.emplace_back(llama_token_data{ token_id, logits[token_id], 0.0f });
        }

        llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };

        // sample the most likely token
        const llama_token new_token_id = llama_sample_token_greedy(context, &candidates_p);
        // end of generation or length
        if (llama_token_is_eog(model, new_token_id) || n_cur == n_len) {
            break;
        }

        //transform token to natural language word
        auto new_token_chars = llama_token_to_piece(context, new_token_id);
        cached_token_chars += new_token_chars;

        //add generated token to llama_batch for next token generation
        llama_batch_clear(*batch);
        llama_batch_add(*batch, new_token_id, n_cur, { 0 }, true);
        // LOGi("n_vocab: %d,n_cur: %d,n_len: %d,new_token_chars: %s, id: %d",n_vocab, n_cur,n_len,new_token_chars.c_str(), new_token_chars.c_str(), new_token_id);
        // evaluate the current batch with the transformer model
        if (llama_decode(context, *batch)) {
            LOGe("llama_decode failed");
        }
        n_cur++;
        n_decode++;
        if(n_decode == 1) {
            clock_gettime(CLOCK_MONOTONIC,&firstTime);
        }
    }
    clock_gettime(CLOCK_MONOTONIC,&endTime);

    LOGi("generated %d tokens in %d ms,first token time:%d ms,avg:%.3f t/s,result:%s",n_decode,formatTime2ms(endTime)- formatTime2ms(startTime),
         formatTime2ms(firstTime)- formatTime2ms(startTime),n_decode /((formatTime2ms(endTime)-formatTime2ms(startTime))/1000.0f) ,cached_token_chars.c_str());
    return env->NewStringUTF(cached_token_chars.c_str());
}

two questions here:

I found that the first time llama_decode will cost 1000ms+ and each time the input_prefix and input_suffix will be tokenized/decoded repeatedly,is ther any way to reuse the output after tokenize/decode the input_prefix and input_suffix ?
the inference also followed some specific pattern and there are many repeated tokens,how can I reuse the same token when token to piece?

Hoping you guys can give me some advice,thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how can I get the best inference speed in my situation #9503

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

how can I get the best inference speed in my situation #9503

Uh oh!

Uh oh!

FranzKafkaYu Sep 16, 2024

Replies: 0 comments

FranzKafkaYu
Sep 16, 2024