-
Notifications
You must be signed in to change notification settings - Fork 376
Why is the feed_prompt process so slow? #439
Description
LLM is indeed a fantastic library and very easy to use. However, after using LLM for a few days, I noticed that the process of feed_prompt
is always very slow. It consumes a significant amount of CPU resources and doesn't utilize GPU resources (I found in the hardware acceleration documentation that feed_prompt
currently doesn't use GPU resources). As a result, if I add some context during the conversation, it takes a long time to wait for feed_prompt to complete, which is not ideal for the actual user experience. I used TheBloke/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q2_K.bin for testing.
Using the same model and prompt, I tested with llama.cpp
, and its first token response time is very fast. I'm not sure what the difference is in the feed_prompt
process between llm
and llama.cpp
. By observing CPU history and GPU history,It seems like llama.cpp
is fully utilizing the GPU for inference.
Can you please help me identify what's wrong?
Model:
System:
- Apple 2020 M1 16GB
- MacOS 13.6.1 (22G313)
llama.cpp command:
./main -m {{MODEL_PATH}} -p "[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
[/INST]
[INST] What is the largest animal in the world ? [/INST]
"
llama.cpp Result:
llama_print_timings: load time = 473.17 ms
llama_print_timings: sample time = 49.00 ms / 144 runs ( 0.34 ms per token, 2938.90 tokens per second)
llama_print_timings: prompt eval time = 1460.21 ms / 155 tokens ( 9.42 ms per token, 106.15 tokens per second)
llama_print_timings: eval time = 11099.90 ms / 143 runs ( 77.62 ms per token, 12.88 tokens per second)
llama_print_timings: total time = 12666.70 ms
llm sample code:
const DEFAULT_PROMPT: &'static str = r#"[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
[/INST]
[INST] What is the largest animal in the world ? [/INST]
"#;
let model_path = PathBuf::from(MODEL_FILE);
let model = llm::load_dynamic(
Some(llm::ModelArchitecture::Llama),
&model_path,
llm::TokenizerSource::Embedded,
llm::ModelParameters {
prefer_mmap: true,
use_gpu: true,
..Default::default()
},
llm::load_progress_callback_stdout,
)
.unwrap();
let session_config = InferenceSessionConfig {
n_batch: 512,
..Default::default()
};
let mut session = model.start_session(session_config);
let mut rng = rand::thread_rng();
let mut output_request = llm::OutputRequest::default();
let sampler = Arc::new(Mutex::new(
SamplerChain::<u32, f32>::new()
+ SampleTemperature::new(0.2)
+ SampleTopK::new(40, 40)
+ SampleTopP::new(0.95, 40)
+ SampleRandDistrib::new(),
));
let params = llm::InferenceParameters { sampler };
let ts = Instant::now();
let mut first_token_time: Option<f32> = None;
let ret = session
.infer::<Infallible>(
model.as_ref(),
&mut rng,
&llm::InferenceRequest {
prompt: llm::Prompt::Text(DEFAULT_PROMPT),
parameters: ¶ms,
play_back_previous_tokens: false,
maximum_token_count: Some(1500),
},
&mut output_request,
llm::conversation_inference_callback("[INST]", |t| {
if first_token_time.is_none() {
first_token_time = Some(ts.elapsed().as_secs_f32());
}
print_token(t)
}),
)
.unwrap();
println!("{stats:#?}", stats = ret,);
println!("first time to token: {first_token_time:?}");
println!("token count {:?}", ret.prompt_tokens + ret.predict_tokens);
println!(
"prompt token speed {:?}/s",
ret.prompt_tokens as f32 / ret.feed_prompt_duration.as_secs_f32()
);
println!(
"predict token speed {:?}/s",
ret.predict_tokens as f32 / ret.predict_duration.as_secs_f32()
);
println!(
"summary speed {:?}/s",
(ret.predict_tokens + ret.prompt_tokens) as f32
/ (ret.predict_duration.as_secs_f32() + ret.feed_prompt_duration.as_secs_f32())
);
llm sample code result:
InferenceStats {
feed_prompt_duration: 10.74704s,
prompt_tokens: 155,
predict_duration: 28.863045s,
predict_tokens: 397,
}
first time to token: Some(11.22408)
token count 552
prompt token speed 14.422576/s
predict token speed 13.754613/s
summary speed 13.935845/s