Possible to use second larger model to build a relevant token cache? #10099
AncientMystic
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Would it be possible to use two models in combination with weights on a second model, lets say
Base model: personality, writing style, etc to write the response
+
second model:
Cherry pick relevant tokens based on weight
0.1 > 0.3 > 0.5 > 0.7 > 0.9 > 1
Relevant > somewhat relevant > distantly relevant > somewhat irrelevant > irrelevant > completely irrelevant
To possibly cache tokens from a second model based on a relevance weight, so one could load a 1-13B model and a 13-400B+ model, then allow Llama.cpp to cherry pick relevant/somewhat relevant tokens from the larger model, loading them into a ram or vram cache as a reference and then use the smaller model to respond?
Possibly even using layer wise inferencing to load the second extremely large model layer by layer to the GPU from disk for processing.
Initial load time to basically build the relevance cache would be a little high but subsequent responses would be faster and the boost in efficiency/performance vs trying to load an extremely large model seems like it would be a monumental difference.
I keep thinking about the fact these models are trained on so much data that is irrelevant to each usage case and how we can somehow prune irrelevant data on the fly to make better use of vram/processing cycles by not wasting them processing everything that isn't remotely relevant like other languages, programming languages, data about completely irrelevant subjects(like wiki, news, celebrities, etc), etc if it has no relevance to the current task and this idea came up.
it seems like it could at least be a nice way to give a huge intelligence boost to smaller models at the expense of longer initial load times.
(While i know models are not quite like databases, i still feel like we are wasting a lot of ram/vram/processing cycles rehashing irrelevant data to produce results with current LLMs)
Beta Was this translation helpful? Give feedback.
All reactions