Why llama_build_graph() function is called for every token generation? #4617

Nick-infinity · 2023-12-24T08:14:04Z

Nick-infinity
Dec 24, 2023

I am trying to understand the flow for forward pass / graph compute / inference .
Please correct me if my understanding is wrong.
The flow for inference starts after model and tensors are parsed from file.

llama_decode() -> llama_decode_internal() -> llama_build_graph() (creates llm structure) -> llm.build_llama() -> ggml_build_forward_expand() -> ggml_build_forward_impl() -> ggml_visit_parents() -> llm.free()

I think llama_decode() is called for every new token generation in decoding stage. This will call llama_build_graph() and all the nodes and tensor mapping to each node will happen every time for a new token generation.

Is my understanding correct?
Will it not add to the latency for new token. generation as this chain is called again and again?
Can we store the mapping of tensors and nodes in a "state" and just called "execute()" on nodes in-order?

@ggerganov

Answered by ggerganov

Dec 24, 2023

Yes
It will
The graphs for generating a sequence can be pre-built once and reused which would reduce the overhead from building the graph on-the-fly for each llama_decode(). However, this is not readily available through the existing API, though it can be achieved by hacking llama.cpp

There are many custom optimizations like this that can be applied based on the specific use case. With time, we will try to support these, but it takes time to arrive at the correct API

View full answer

ggerganov · 2023-12-24T09:21:08Z

ggerganov
Dec 24, 2023
Maintainer

Yes
It will
The graphs for generating a sequence can be pre-built once and reused which would reduce the overhead from building the graph on-the-fly for each llama_decode(). However, this is not readily available through the existing API, though it can be achieved by hacking llama.cpp

There are many custom optimizations like this that can be applied based on the specific use case. With time, we will try to support these, but it takes time to arrive at the correct API

0 replies

Nick-infinity · 2023-12-24T10:19:52Z

Nick-infinity
Dec 24, 2023
Author

Thanks! I have few ideas to separate the preparation and execution in different stages to mininize overheads. Thanks for replying.

I am also trying to understand the architecture of newly introduced backend interfaces and implementation to come up with my custom backend. Do we have any documentation or UML yet?

1 reply

slaren Dec 24, 2023
Maintainer

There isn't any documentation at the moment, but it is very simple. At the most basic level, you have to provide functions for managing buffers and for evaluating graphs. I would suggest looking at the implementation of the CPU backend in ggml-backend.c to get started. Understanding the way graphs are represented and evaluated in ggml is more complicated, but you can use the implementations of any of the backends as an example.

vgabbo · 2024-02-10T12:59:23Z

vgabbo
Feb 10, 2024

Hi @Nick-infinity , did you try to implement this optimization? Noticed any improvement for inference-time? I'm trying to study the library llama.cpp, but my C/C++ is kind of rusty and it's taking time to understand how the flow goes...
Still, I was very curious about this idea. Thank you in advance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why llama_build_graph() function is called for every token generation? #4617

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why llama_build_graph() function is called for every token generation? #4617

Uh oh!

Nick-infinity Dec 24, 2023

Replies: 3 comments · 1 reply

Uh oh!

ggerganov Dec 24, 2023 Maintainer

Uh oh!

Nick-infinity Dec 24, 2023 Author

Uh oh!

slaren Dec 24, 2023 Maintainer

Uh oh!

vgabbo Feb 10, 2024

Nick-infinity
Dec 24, 2023

Replies: 3 comments 1 reply

ggerganov
Dec 24, 2023
Maintainer

Nick-infinity
Dec 24, 2023
Author

slaren Dec 24, 2023
Maintainer

vgabbo
Feb 10, 2024