Why is the feed_prompt process so slow?

LLM is indeed a fantastic library and very easy to use. However, after using LLM for a few days, I noticed that the process of `feed_prompt` is always very slow. It consumes a significant amount of CPU resources and doesn't utilize GPU resources (I found in the hardware acceleration documentation that `feed_prompt` currently doesn't use GPU resources). As a result, if I add some context during the conversation, it takes a long time to wait for feed_prompt to complete, which is not ideal for the actual user experience. I used [TheBloke/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q2_K.bin](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main) for testing. 

Using the same model and prompt, I tested with `llama.cpp`, and its first token response time is very fast. I'm not sure what the difference is in the `feed_prompt` process between `llm` and `llama.cpp`. By observing CPU history and GPU history，It seems like `llama.cpp` is fully utilizing the GPU for inference.

Can you please help me identify what's wrong?

Model:
1.  [TheBloke/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q2_K.bin](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main)

System:

1. Apple 2020 M1 16GB
2. MacOS 13.6.1 (22G313)

llama.cpp command:

```shell
./main -m {{MODEL_PATH}}  -p "[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

[/INST]

[INST] What is the largest animal in the world ? [/INST]
"
```

llama.cpp Result:

```
llama_print_timings:        load time =     473.17 ms
llama_print_timings:      sample time =      49.00 ms /   144 runs   (    0.34 ms per token,  2938.90 tokens per second)
llama_print_timings: prompt eval time =    1460.21 ms /   155 tokens (    9.42 ms per token,   106.15 tokens per second)
llama_print_timings:        eval time =   11099.90 ms /   143 runs   (   77.62 ms per token,    12.88 tokens per second)
llama_print_timings:       total time =   12666.70 ms
```

llm sample code：

```rust
const DEFAULT_PROMPT: &'static str = r#"[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

[/INST]

[INST] What is the largest animal in the world ? [/INST]
"#;

    let model_path = PathBuf::from(MODEL_FILE);
    let model = llm::load_dynamic(
        Some(llm::ModelArchitecture::Llama),
        &model_path,
        llm::TokenizerSource::Embedded,
        llm::ModelParameters {
            prefer_mmap: true,
            use_gpu: true,
            ..Default::default()
        },
        llm::load_progress_callback_stdout,
    )
    .unwrap();

    let session_config = InferenceSessionConfig {
        n_batch: 512,
        ..Default::default()
    };
    let mut session = model.start_session(session_config);
    let mut rng = rand::thread_rng();
    let mut output_request = llm::OutputRequest::default();
    let sampler = Arc::new(Mutex::new(
        SamplerChain::<u32, f32>::new()
            + SampleTemperature::new(0.2)
            + SampleTopK::new(40, 40)
            + SampleTopP::new(0.95, 40)
            + SampleRandDistrib::new(),
    ));
    let params = llm::InferenceParameters { sampler };
    let ts = Instant::now();
    let mut first_token_time: Option<f32> = None;
    let ret = session
        .infer::<Infallible>(
            model.as_ref(),
            &mut rng,
            &llm::InferenceRequest {
                prompt: llm::Prompt::Text(DEFAULT_PROMPT),
                parameters: &params,
                play_back_previous_tokens: false,
                maximum_token_count: Some(1500),
            },
            &mut output_request,
            llm::conversation_inference_callback("[INST]", |t| {
                if first_token_time.is_none() {
                    first_token_time = Some(ts.elapsed().as_secs_f32());
                }
                print_token(t)
            }),
        )
        .unwrap();
    println!("{stats:#?}", stats = ret,);
    println!("first time to token: {first_token_time:?}");
    println!("token count {:?}", ret.prompt_tokens + ret.predict_tokens);
    println!(
        "prompt token speed {:?}/s",
        ret.prompt_tokens as f32 / ret.feed_prompt_duration.as_secs_f32()
    );
    println!(
        "predict token speed {:?}/s",
        ret.predict_tokens as f32 / ret.predict_duration.as_secs_f32()
    );
    println!(
        "summary speed {:?}/s",
        (ret.predict_tokens + ret.prompt_tokens) as f32
            / (ret.predict_duration.as_secs_f32() + ret.feed_prompt_duration.as_secs_f32())
    );
```

llm sample code result:

```
InferenceStats {
    feed_prompt_duration: 10.74704s,
    prompt_tokens: 155,
    predict_duration: 28.863045s,
    predict_tokens: 397,
}
first time to token: Some(11.22408)
token count 552
prompt token speed 14.422576/s
predict token speed 13.754613/s
summary speed 13.935845/s
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why is the feed_prompt process so slow? #439

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why is the feed_prompt process so slow? #439

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions