Skip to content

Why llama_build_graph() function is called for every token generation? #4617

Answered by ggerganov
Nick-infinity asked this question in Q&A
Discussion options

You must be logged in to vote
  1. Yes
  2. It will
  3. The graphs for generating a sequence can be pre-built once and reused which would reduce the overhead from building the graph on-the-fly for each llama_decode(). However, this is not readily available through the existing API, though it can be achieved by hacking llama.cpp

There are many custom optimizations like this that can be applied based on the specific use case. With time, we will try to support these, but it takes time to arrive at the correct API

Replies: 3 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Answer selected by Nick-infinity
Comment options

You must be logged in to vote
1 reply
@slaren
Comment options

Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
4 participants