-
Notifications
You must be signed in to change notification settings - Fork 12.4k
llama : refactor llama_kv_cache, llama_context and llm_build_context #11213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
95 commits
Select commit
Hold shift + click to select a range
f78b396
llama : add struct llama_kv_cache (wip) [no ci]
ggerganov e4550fb
llama : cont
ggerganov 4d7bd03
kv_cache : functions -> members
ggerganov fef90cb
kv_cache : fix
ggerganov 73a14ec
kv_cache : minor
ggerganov 4cd1b6f
context : prepare kv_cache_read/write to be moved to kv_cache
ggerganov fd05ab8
kv_cache : move state read/write to llama_kv_cache
ggerganov 17b363a
llama : update llama_kv_self API
ggerganov a19f671
context : minor
ggerganov ae274f9
llama : fix names [no ci]
ggerganov f2524c0
llama : remove references to llama_kv_cache (wip)
ggerganov b4ec1d4
cont : move kv_self update to llama_context
ggerganov f071349
context : add get_ctx_padding()
ggerganov c75ba68
context : move adapter code in the implementation [no ci]
ggerganov 133ad6a
context : initial need_reserve logic
ggerganov cb8f209
wip
ggerganov 99422df
context : introduce llama_batch_manager
ggerganov a0c500b
context : prepare for abstraction
ggerganov e665b57
Merge branch 'master' into gg/llama-kv-cache
ggerganov 9188856
llama : resolve rwkv conflict
ggerganov c30e34c
Merge branch 'master' into gg/llama-kv-cache
ggerganov a40ba49
Merge branch 'master' into gg/llama-kv-cache
ggerganov 5d3491e
Merge branch 'master' into gg/llama-kv-cache
ggerganov 3e23be7
context : store graph build function callback
ggerganov 74b0807
Merge branch 'master' into gg/llama-kv-cache
ggerganov 1eca891
llama : fix rwkv inference (#11618)
MollySophia e0d913f
llama : clear whitespaces
ggerganov 0f1c1ca
Merge branch 'master' into gg/llama-kv-cache
ggerganov b15fede
kv-cache : fix defrag condition
ggerganov 972f91c
Merge branch 'master' into gg/llama-kv-cache
ggerganov f9971ef
llama : dedup reserve code
ggerganov 879ba82
server : increase context size for the tests
ggerganov ef358ee
context : add decode/encode
ggerganov d1d8d53
bman : remove ubatch member
ggerganov 2cd8a90
context : make output functions members
ggerganov 02ef4be
context : initial abstraction
ggerganov b52b79b
context : move encode/decode to llama-context.cpp
ggerganov 8da7f61
context : improve llama_context encapsulation
ggerganov d146a14
context : minor naming fix
ggerganov 5eae8e5
context : move build_rope_factors to base class
ggerganov e633dc1
context : introduce llama_graph_i
ggerganov 0ab50f1
context : prepare llama_model graph build
ggerganov f63aeec
llama : models now build their graphs using llama_graph_i
ggerganov 6ee86e5
graph : restore ubatch in build_cb
ggerganov fbe6a07
context : rename to llama_context_kv_self
ggerganov 3a504d9
llama : introduce llama_io interfaces
ggerganov f7c7757
context : abstract state read/write
ggerganov e08f38d
context : minor cleanup
ggerganov 107d1e2
context : move output functionality to base class
ggerganov ed3cb55
context : abstract input
ggerganov 131743f
context : abstract constructor and init
ggerganov d5e8e1a
context : remove batch_manager
ggerganov 8280645
context : move common inputs to base class
ggerganov 1d801d2
graph : update attn/kv_self names
ggerganov f0d3ff2
Merge branch 'master' into gg/llama-kv-cache
ggerganov c235903
graph : add llama_graph_result
ggerganov 172f616
cont : return important tensors
ggerganov bc6f187
cont : use returend tensors from the graph build
ggerganov befe14f
llama : reorder encode/decode in sources
ggerganov 9e50456
context : minor simplify
ggerganov 2bffc2d
model : pass llama_graph_i as ptr
ggerganov f5cedbc
kv-cache : prepare for abstraction
ggerganov 5f11a55
kv-cache : remove llama_kv_cache_i
ggerganov e17e4b7
context : add llama_context_recurrent
ggerganov 2eacb4c
graph : simplify attention api
ggerganov f95b04a
model : fix order kvq -> qkv
ggerganov 072280e
Merge branch 'master' into gg/llama-kv-cache
ggerganov b1554be
context : add cache-less llama_context
ggerganov ad870c4
context : fix causal input for cache-less case
ggerganov 08011c2
context : add llama_kv_cache_recurrent prototype
ggerganov 2645a7d
context : add save/load for recurrent context
ggerganov 548c230
graph : remove worst_case from the API
ggerganov ebf1bdf
context : add logs
ggerganov f588a70
context : wrap input tensors in struct
ggerganov 3753b30
context : fix n_outputs init
ggerganov c4c0a4d
Merge branch 'master' into gg/llama-kv-cache
ggerganov f5e8020
wip enc-dec
ggerganov 372fa3a
cont : enc should work now, next is dec
ggerganov 6378112
graph : remove the build_kv_... API from llama_graph_i
ggerganov 0699a44
context : remove redundant virtual, protected -> private
ggerganov a5a85a3
context : fix recurrent reserve
ggerganov 4a1054b
context : reuse built_attn_mha
ggerganov 9cd78f1
context : explicit llama_context_i abstract interface
ggerganov be58e30
enc-dec : compose wip
ggerganov e5bc5f8
context : enc-dec is now working
ggerganov e2b3294
context : fix enc-dec state save/load
ggerganov 4efe989
context : pass embeddings tensor from encoder to decoder
ggerganov 952feed
context : disable encoder embd tensor for now
ggerganov 82675a0
Merge branch 'master' into gg/llama-kv-cache
ggerganov 828effd
kv-cache : basic abstraction
ggerganov 38db8a5
llama : introduce concept of llama_memory
ggerganov 7f02ee5
context : decouple inputs, llama_graph_i become const (WIP)
ggerganov 9cab53c
cont : migrate the rest of the inputs out of llama_context
ggerganov 0f7daa9
graph : move non-context related logic to llm_build_context
ggerganov 624f7bd
graph : add comments
ggerganov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change we now use directly the embeddings produced by the encoder (
cross->t_embd
) as input for the decoder's cross-attention without downloading/uploading to/from RAM.This seems to work correctly, but in debug it hits the following assert when I explicitly use
-dev none
on Mac:./bin/llama-cli \ -m ../models/google-t5-small/ggml-model-f16.gguf \ -p 'Translate from English to German: The house is wonderful.' \ -dev none
I think it is related to re-using the tensor from the encoder context, but I am not sure if the assert is correct in this case. @slaren Any ideas?
Edit: btw it does not hit the assert either without
-dev none
or with-dev none -fa
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure exactly what triggers the assert, probably because the graph didn't change except in that one tensor that previously was a view now it isn't, and ggml-alloc is not correctly detecting that the graph changed in an incompatible way. However, I don't think this is correct either way, because to do this you would need to allocate this tensor in a different buffer/sched, it's not possible to use tensors allocated in the compute buffer in the next graph, since the compute buffer is reset with each graph.