LocalLLM TokenIterator is taking a good chunk of time during my chats. Am I doing something wrong? #87
-
What Stanford Spezi module is your challenge related to?Spezi Description
ReproductionEach call to session.generate() has a 3-4 second delay between streams. Expected behaviorI'm looking for a way to get the generator to start writing sooner without that delay. Additional contextNo response Code of Conduct
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Once the model is initialized, the 3-4 seconds you mentioned are likely the time required for the LLM to process the input before it begins generating output tokens. Meaning, a longer input will lead to a longer delay in processing input tokens. There's not really anything we can do about that, we're simply working with constraint resources on the local device. @LeonNissen Not sure if there's more to it (I doubt it), might you have any additional insights? |
Beta Was this translation helpful? Give feedback.
Hi @bryan1anderson,
The initial call may take longer (since the model needs to be loaded into memory), but subsequent runs should be faster. However, as @philippzagar correctly pointed out, factors like context window size, model type, and available resources can still impact response times.
In a future version of SpeziLLM, we plan to expose performance metrics such as tokens per second and time to first token (generation tokens per second), that could help you identifying the issue here.
As a workaround, you might consider using a smaller model or reducing the context window.