Skip to content

Fix for #617 updated token calculation on parsenode #618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

tm-robinson
Copy link
Contributor

This is a fix for #617 which changes the token counting from splitting the string on space characters to use tiktoken.

This is a precursor to fixing #543 .

@VinciGit00
Copy link
Collaborator

Is it valid for all the models or just for OpenAI?

@VinciGit00
Copy link
Collaborator

Hi, using Ollama we have the following error
Screenshot 2024-09-02 alle 11 40 11

@tm-robinson
Copy link
Contributor Author

@VinciGit00 ah, I only tested with OpenAI. I would imagine tiktoken may not be aware of the encodings for ollama models. I will have a look at this later to see if I can make the llm_model parameter optional for cases where tiktoken may not support encoding.

@f-aguzzi
Copy link
Member

f-aguzzi commented Sep 2, 2024

This is a simpler error, the llm_model key was not added into the ParseNode of the SmartScraperGraph. I'm testing out if it now works.

@LorenzoPaleari
Copy link
Contributor

LorenzoPaleari commented Sep 2, 2024

From the documentation it seems that TikToken works only on OpenAI models.
What will happen when a model not recognised by TikToken gets used?

It is maybe safer to put a condition on the usage of this token counter on ParseNode?

All custom graphs are also missing this new configuration key and thus failing

@f-aguzzi
Copy link
Member

f-aguzzi commented Sep 2, 2024

We were making simething similar in #554 / #556 - we might try to merge it into this PR if @VinciGit00 is on board with the idea. His plan was to make four different tokenizers, three of which for the most commonly used model families (the GPTs, the Mistrals and the LLamas, according to telemetry data) and one, generic but not very accurate, for all the others.

We might reuse the code from @tm-robinson for OpenAI, patched with Vinci's Mistral tokenizer from #556, and add the missing LLama tokenizer.

@tm-robinson
Copy link
Contributor Author

I've added some exception handling to the token counting code so that if the model is not specified (e.g. in the case of Ollama), or the model name is not supported by tiktoken's encoding_for_model function, the name of an OpenAI model (gpt-4o-mini) is used instead.

This means the token count for models that use a different encoding to OpenAI will be incorrect, but it will still be closer to the correct count than previously.

@DiTo97
Copy link
Collaborator

DiTo97 commented Sep 3, 2024

I've added some exception handling to the token counting code so that if the model is not specified (e.g. in the case of Ollama), or the model name is not supported by tiktoken's encoding_for_model function, the name of an OpenAI model (gpt-4o-mini) is used instead.

This means the token count for models that use a different encoding to OpenAI will be incorrect, but it will still be closer to the correct count than previously.

not sure that's the desired behavior, any time we auto default to something there should be very good reason, and gpt-4o-mini's tokenizer is much more compact and efficient than other OS variants, so it will likely generate overly-optimistic token count estimates.

I think having four separate token count estimators (GPTs, mistrals, LLaMAs and a generic one) is the better approach and what we should target

@VinciGit00
Copy link
Collaborator

hi, we would like that you will make work this branch https://github.com/ScrapeGraphAI/Scrapegraph-ai/tree/refactoring-tokenization

@tm-robinson
Copy link
Contributor Author

@VinciGit00 I'm working on this, I've mostly made the changes needed for OpenAI and Mistral, just need to test them. Once done I'll put them into this branch and then look at Ollama separately.

@VinciGit00
Copy link
Collaborator

Oh perfect thank you

@tm-robinson
Copy link
Contributor Author

Assuming #643 looks good then I think this one can be closed and I can continue working in that branch.

@VinciGit00
Copy link
Collaborator

Ok thx

@VinciGit00 VinciGit00 closed this Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants