Why llama_build_graph() function is called for every token generation? #4617
-
I am trying to understand the flow for forward pass / graph compute / inference . llama_decode() -> llama_decode_internal() -> llama_build_graph() (creates llm structure) -> llm.build_llama() -> ggml_build_forward_expand() -> ggml_build_forward_impl() -> ggml_visit_parents() -> llm.free() I think llama_decode() is called for every new token generation in decoding stage. This will call llama_build_graph() and all the nodes and tensor mapping to each node will happen every time for a new token generation.
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
There are many custom optimizations like this that can be applied based on the specific use case. With time, we will try to support these, but it takes time to arrive at the correct API |
Beta Was this translation helpful? Give feedback.
-
Thanks! I have few ideas to separate the preparation and execution in different stages to mininize overheads. Thanks for replying. I am also trying to understand the architecture of newly introduced backend interfaces and implementation to come up with my custom backend. Do we have any documentation or UML yet? |
Beta Was this translation helpful? Give feedback.
-
Hi @Nick-infinity , did you try to implement this optimization? Noticed any improvement for inference-time? I'm trying to study the library llama.cpp, but my C/C++ is kind of rusty and it's taking time to understand how the flow goes... |
Beta Was this translation helpful? Give feedback.
llama_decode()
. However, this is not readily available through the existing API, though it can be achieved by hackingllama.cpp
There are many custom optimizations like this that can be applied based on the specific use case. With time, we will try to support these, but it takes time to arrive at the correct API