Possible to use second larger model to build a relevant token cache? #10099

AncientMystic · 2024-10-30T21:23:19Z

AncientMystic
Oct 30, 2024

Would it be possible to use two models in combination with weights on a second model, lets say

Base model: personality, writing style, etc to write the response
+
second model:
Cherry pick relevant tokens based on weight

0.1 > 0.3 > 0.5 > 0.7 > 0.9 > 1

Relevant > somewhat relevant > distantly relevant > somewhat irrelevant > irrelevant > completely irrelevant

To possibly cache tokens from a second model based on a relevance weight, so one could load a 1-13B model and a 13-400B+ model, then allow Llama.cpp to cherry pick relevant/somewhat relevant tokens from the larger model, loading them into a ram or vram cache as a reference and then use the smaller model to respond?

Possibly even using layer wise inferencing to load the second extremely large model layer by layer to the GPU from disk for processing.

Initial load time to basically build the relevance cache would be a little high but subsequent responses would be faster and the boost in efficiency/performance vs trying to load an extremely large model seems like it would be a monumental difference.

I keep thinking about the fact these models are trained on so much data that is irrelevant to each usage case and how we can somehow prune irrelevant data on the fly to make better use of vram/processing cycles by not wasting them processing everything that isn't remotely relevant like other languages, programming languages, data about completely irrelevant subjects(like wiki, news, celebrities, etc), etc if it has no relevance to the current task and this idea came up.

it seems like it could at least be a nice way to give a huge intelligence boost to smaller models at the expense of longer initial load times.

(While i know models are not quite like databases, i still feel like we are wasting a lot of ram/vram/processing cycles rehashing irrelevant data to produce results with current LLMs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible to use second larger model to build a relevant token cache? #10099

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Possible to use second larger model to build a relevant token cache? #10099

Uh oh!

Uh oh!

AncientMystic Oct 30, 2024

Replies: 0 comments

AncientMystic
Oct 30, 2024