Skip to content
This repository was archived by the owner on Jun 24, 2024. It is now read-only.
This repository was archived by the owner on Jun 24, 2024. It is now read-only.

Why is the feed_prompt process so slow? #439

@zackshen

Description

@zackshen

LLM is indeed a fantastic library and very easy to use. However, after using LLM for a few days, I noticed that the process of feed_prompt is always very slow. It consumes a significant amount of CPU resources and doesn't utilize GPU resources (I found in the hardware acceleration documentation that feed_prompt currently doesn't use GPU resources). As a result, if I add some context during the conversation, it takes a long time to wait for feed_prompt to complete, which is not ideal for the actual user experience. I used TheBloke/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q2_K.bin for testing.

Using the same model and prompt, I tested with llama.cpp, and its first token response time is very fast. I'm not sure what the difference is in the feed_prompt process between llm and llama.cpp. By observing CPU history and GPU history,It seems like llama.cpp is fully utilizing the GPU for inference.

Can you please help me identify what's wrong?

Model:

  1. TheBloke/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q2_K.bin

System:

  1. Apple 2020 M1 16GB
  2. MacOS 13.6.1 (22G313)

llama.cpp command:

./main -m {{MODEL_PATH}}  -p "[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

[/INST]

[INST] What is the largest animal in the world ? [/INST]
"

llama.cpp Result:

llama_print_timings:        load time =     473.17 ms
llama_print_timings:      sample time =      49.00 ms /   144 runs   (    0.34 ms per token,  2938.90 tokens per second)
llama_print_timings: prompt eval time =    1460.21 ms /   155 tokens (    9.42 ms per token,   106.15 tokens per second)
llama_print_timings:        eval time =   11099.90 ms /   143 runs   (   77.62 ms per token,    12.88 tokens per second)
llama_print_timings:       total time =   12666.70 ms

llm sample code:

const DEFAULT_PROMPT: &'static str = r#"[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

[/INST]

[INST] What is the largest animal in the world ? [/INST]
"#;

    let model_path = PathBuf::from(MODEL_FILE);
    let model = llm::load_dynamic(
        Some(llm::ModelArchitecture::Llama),
        &model_path,
        llm::TokenizerSource::Embedded,
        llm::ModelParameters {
            prefer_mmap: true,
            use_gpu: true,
            ..Default::default()
        },
        llm::load_progress_callback_stdout,
    )
    .unwrap();

    let session_config = InferenceSessionConfig {
        n_batch: 512,
        ..Default::default()
    };
    let mut session = model.start_session(session_config);
    let mut rng = rand::thread_rng();
    let mut output_request = llm::OutputRequest::default();
    let sampler = Arc::new(Mutex::new(
        SamplerChain::<u32, f32>::new()
            + SampleTemperature::new(0.2)
            + SampleTopK::new(40, 40)
            + SampleTopP::new(0.95, 40)
            + SampleRandDistrib::new(),
    ));
    let params = llm::InferenceParameters { sampler };
    let ts = Instant::now();
    let mut first_token_time: Option<f32> = None;
    let ret = session
        .infer::<Infallible>(
            model.as_ref(),
            &mut rng,
            &llm::InferenceRequest {
                prompt: llm::Prompt::Text(DEFAULT_PROMPT),
                parameters: &params,
                play_back_previous_tokens: false,
                maximum_token_count: Some(1500),
            },
            &mut output_request,
            llm::conversation_inference_callback("[INST]", |t| {
                if first_token_time.is_none() {
                    first_token_time = Some(ts.elapsed().as_secs_f32());
                }
                print_token(t)
            }),
        )
        .unwrap();
    println!("{stats:#?}", stats = ret,);
    println!("first time to token: {first_token_time:?}");
    println!("token count {:?}", ret.prompt_tokens + ret.predict_tokens);
    println!(
        "prompt token speed {:?}/s",
        ret.prompt_tokens as f32 / ret.feed_prompt_duration.as_secs_f32()
    );
    println!(
        "predict token speed {:?}/s",
        ret.predict_tokens as f32 / ret.predict_duration.as_secs_f32()
    );
    println!(
        "summary speed {:?}/s",
        (ret.predict_tokens + ret.prompt_tokens) as f32
            / (ret.predict_duration.as_secs_f32() + ret.feed_prompt_duration.as_secs_f32())
    );

llm sample code result:

InferenceStats {
    feed_prompt_duration: 10.74704s,
    prompt_tokens: 155,
    predict_duration: 28.863045s,
    predict_tokens: 397,
}
first time to token: Some(11.22408)
token count 552
prompt token speed 14.422576/s
predict token speed 13.754613/s
summary speed 13.935845/s

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions