Adding @agray3's graph caching approach #94
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@agray3 has PR-8366 open in mainline
llama.cpp
that appears to not meet the high standards of thellama.cpp
maintainers. Me, being more pragmatic and less of a purist, would like to have these changes here as that way one avoids rebuilding the computation graph for every new token, a "feature" inherited fromllama.cpp
that I don't really like.Here is what we get in performance improvement on CUDA (RTX-4080 with a Ryzen-7950X CPU)
And here the performance improvement on Metal (M2-Max 30-core GPU, M2-Max CPU):
The speedup obviously increases with decreasing model size as the time computing the graph becomes relatively shorter compared to the time taken building the graph. The speedup I observe is smaller compared to what @agray3 reports in PR-8366. I guess, it is a matter of how fast the GPU is (where the graph is computed) relative to the CPU (where the graph is built).
GPU performance has not been a focus of this project. Still, how do we do relative to mainline llama.cpp after this PR? Using afd9909a (3942) from today, I get this for the RTX-4080
and this for the M2-Max
@agray3 Would you review the changes? Alternatively, if you prefer, we can close this PR and you can submit a PR yourself so this contribution is correctly associated with your name.