-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Fix for #617 updated token calculation on parsenode #618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for #617 updated token calculation on parsenode #618
Conversation
Is it valid for all the models or just for OpenAI? |
@VinciGit00 ah, I only tested with OpenAI. I would imagine tiktoken may not be aware of the encodings for ollama models. I will have a look at this later to see if I can make the llm_model parameter optional for cases where tiktoken may not support encoding. |
This is a simpler error, the llm_model key was not added into the |
From the documentation it seems that TikToken works only on OpenAI models. It is maybe safer to put a condition on the usage of this token counter on ParseNode? All custom graphs are also missing this new configuration key and thus failing |
We were making simething similar in #554 / #556 - we might try to merge it into this PR if @VinciGit00 is on board with the idea. His plan was to make four different tokenizers, three of which for the most commonly used model families (the GPTs, the Mistrals and the LLamas, according to telemetry data) and one, generic but not very accurate, for all the others. We might reuse the code from @tm-robinson for OpenAI, patched with Vinci's Mistral tokenizer from #556, and add the missing LLama tokenizer. |
…coding, and fix smart_scraper_graph to pass model
…into 617-fix-token-counting-in-parsenode
I've added some exception handling to the token counting code so that if the model is not specified (e.g. in the case of Ollama), or the model name is not supported by tiktoken's This means the token count for models that use a different encoding to OpenAI will be incorrect, but it will still be closer to the correct count than previously. |
not sure that's the desired behavior, any time we auto default to something there should be very good reason, and gpt-4o-mini's tokenizer is much more compact and efficient than other OS variants, so it will likely generate overly-optimistic token count estimates. I think having four separate token count estimators (GPTs, mistrals, LLaMAs and a generic one) is the better approach and what we should target |
hi, we would like that you will make work this branch https://github.com/ScrapeGraphAI/Scrapegraph-ai/tree/refactoring-tokenization |
@VinciGit00 I'm working on this, I've mostly made the changes needed for OpenAI and Mistral, just need to test them. Once done I'll put them into this branch and then look at Ollama separately. |
Oh perfect thank you |
Assuming #643 looks good then I think this one can be closed and I can continue working in that branch. |
Ok thx |
This is a fix for #617 which changes the token counting from splitting the string on space characters to use
tiktoken
.This is a precursor to fixing #543 .