Fix for #617 updated token calculation on parsenode #618

tm-robinson · 2024-09-02T07:06:34Z

This is a fix for #617 which changes the token counting from splitting the string on space characters to use tiktoken.

This is a precursor to fixing #543 .

VinciGit00 · 2024-09-02T07:25:55Z

Is it valid for all the models or just for OpenAI?

VinciGit00 · 2024-09-02T09:40:48Z

Hi, using Ollama we have the following error

tm-robinson · 2024-09-02T10:31:04Z

@VinciGit00 ah, I only tested with OpenAI. I would imagine tiktoken may not be aware of the encodings for ollama models. I will have a look at this later to see if I can make the llm_model parameter optional for cases where tiktoken may not support encoding.

f-aguzzi · 2024-09-02T10:38:16Z

This is a simpler error, the llm_model key was not added into the ParseNode of the SmartScraperGraph. I'm testing out if it now works.

LorenzoPaleari · 2024-09-02T11:28:25Z

From the documentation it seems that TikToken works only on OpenAI models.
What will happen when a model not recognised by TikToken gets used?

It is maybe safer to put a condition on the usage of this token counter on ParseNode?

All custom graphs are also missing this new configuration key and thus failing

f-aguzzi · 2024-09-02T12:37:58Z

We were making simething similar in #554 / #556 - we might try to merge it into this PR if @VinciGit00 is on board with the idea. His plan was to make four different tokenizers, three of which for the most commonly used model families (the GPTs, the Mistrals and the LLamas, according to telemetry data) and one, generic but not very accurate, for all the others.

We might reuse the code from @tm-robinson for OpenAI, patched with Vinci's Mistral tokenizer from #556, and add the missing LLama tokenizer.

…coding, and fix smart_scraper_graph to pass model

…into 617-fix-token-counting-in-parsenode

tm-robinson · 2024-09-03T08:26:22Z

I've added some exception handling to the token counting code so that if the model is not specified (e.g. in the case of Ollama), or the model name is not supported by tiktoken's encoding_for_model function, the name of an OpenAI model (gpt-4o-mini) is used instead.

This means the token count for models that use a different encoding to OpenAI will be incorrect, but it will still be closer to the correct count than previously.

DiTo97 · 2024-09-03T09:38:41Z

I've added some exception handling to the token counting code so that if the model is not specified (e.g. in the case of Ollama), or the model name is not supported by tiktoken's encoding_for_model function, the name of an OpenAI model (gpt-4o-mini) is used instead.

This means the token count for models that use a different encoding to OpenAI will be incorrect, but it will still be closer to the correct count than previously.

not sure that's the desired behavior, any time we auto default to something there should be very good reason, and gpt-4o-mini's tokenizer is much more compact and efficient than other OS variants, so it will likely generate overly-optimistic token count estimates.

I think having four separate token count estimators (GPTs, mistrals, LLaMAs and a generic one) is the better approach and what we should target

VinciGit00 · 2024-09-03T09:43:05Z

hi, we would like that you will make work this branch https://github.com/ScrapeGraphAI/Scrapegraph-ai/tree/refactoring-tokenization

tm-robinson · 2024-09-06T22:23:24Z

@VinciGit00 I'm working on this, I've mostly made the changes needed for OpenAI and Mistral, just need to test them. Once done I'll put them into this branch and then look at Ollama separately.

VinciGit00 · 2024-09-07T06:56:58Z

Oh perfect thank you

tm-robinson · 2024-09-08T07:00:12Z

Assuming #643 looks good then I think this one can be closed and I can continue working in that branch.

VinciGit00 · 2024-09-08T07:05:42Z

Ok thx

updated token calculation on parsenode

a8b0e4a

tm-robinson added 3 commits September 3, 2024 08:25

updated token calculation on parsenode

186c4f3

update token calculator to handle unknown models and assume openai en…

40a76a4

…coding, and fix smart_scraper_graph to pass model

Merge branch '617-temporary-token-counting-assuming-openai-encoding' …

53d30c9

…into 617-fix-token-counting-in-parsenode

tm-robinson mentioned this pull request Sep 8, 2024

Updates to fix #617 for OpenAI and Mistral models #643

Merged

VinciGit00 closed this Sep 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix for #617 updated token calculation on parsenode #618

Fix for #617 updated token calculation on parsenode #618

Uh oh!

tm-robinson commented Sep 2, 2024

Uh oh!

VinciGit00 commented Sep 2, 2024

Uh oh!

VinciGit00 commented Sep 2, 2024

Uh oh!

tm-robinson commented Sep 2, 2024

Uh oh!

f-aguzzi commented Sep 2, 2024

Uh oh!

LorenzoPaleari commented Sep 2, 2024 •

edited

Loading

Uh oh!

f-aguzzi commented Sep 2, 2024

Uh oh!

tm-robinson commented Sep 3, 2024

Uh oh!

DiTo97 commented Sep 3, 2024 •

edited

Loading

Uh oh!

VinciGit00 commented Sep 3, 2024

Uh oh!

tm-robinson commented Sep 6, 2024

Uh oh!

VinciGit00 commented Sep 7, 2024

Uh oh!

tm-robinson commented Sep 8, 2024

Uh oh!

VinciGit00 commented Sep 8, 2024

Uh oh!

Uh oh!

Uh oh!

Fix for #617 updated token calculation on parsenode #618

Fix for #617 updated token calculation on parsenode #618

Uh oh!

Conversation

tm-robinson commented Sep 2, 2024

Uh oh!

VinciGit00 commented Sep 2, 2024

Uh oh!

VinciGit00 commented Sep 2, 2024

Uh oh!

tm-robinson commented Sep 2, 2024

Uh oh!

f-aguzzi commented Sep 2, 2024

Uh oh!

LorenzoPaleari commented Sep 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

f-aguzzi commented Sep 2, 2024

Uh oh!

tm-robinson commented Sep 3, 2024

Uh oh!

DiTo97 commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VinciGit00 commented Sep 3, 2024

Uh oh!

tm-robinson commented Sep 6, 2024

Uh oh!

VinciGit00 commented Sep 7, 2024

Uh oh!

tm-robinson commented Sep 8, 2024

Uh oh!

VinciGit00 commented Sep 8, 2024

Uh oh!

Uh oh!

LorenzoPaleari commented Sep 2, 2024 •

edited

Loading

DiTo97 commented Sep 3, 2024 •

edited

Loading