Skip to content

Commit f10e7c8

Browse files
kevmo314Neo Zhang
authored andcommitted
common : preallocate sampling token data vector (ggml-org#8363)
`emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op. Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.
1 parent 344eeb3 commit f10e7c8

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

common/sampling.cpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -378,7 +378,7 @@ static llama_token_data_array llama_sampling_prepare_impl(
378378
if (ctx_sampling->grammar != NULL && !apply_grammar) {
379379
GGML_ASSERT(original_logits != NULL);
380380
// Only make a copy of the original logits if we are not applying grammar checks, not sure if I actually have to do this.
381-
*original_logits = {logits, logits + llama_n_vocab(llama_get_model(ctx_main))};
381+
*original_logits = {logits, logits + n_vocab};
382382
}
383383

384384
// apply params.logit_bias map
@@ -391,10 +391,10 @@ static llama_token_data_array llama_sampling_prepare_impl(
391391
llama_sample_apply_guidance(ctx_main, logits, logits_guidance, params.cfg_scale);
392392
}
393393

394-
cur.clear();
394+
cur.resize(n_vocab);
395395

396396
for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
397-
cur.emplace_back(llama_token_data{token_id, logits[token_id], 0.0f});
397+
cur[token_id] = llama_token_data{token_id, logits[token_id], 0.0f};
398398
}
399399

400400
llama_token_data_array cur_p = { cur.data(), cur.size(), false };

0 commit comments

Comments
 (0)