Skip to content
Discussion options

You must be logged in to vote

Hi @bryan1anderson,

The initial call may take longer (since the model needs to be loaded into memory), but subsequent runs should be faster. However, as @philippzagar correctly pointed out, factors like context window size, model type, and available resources can still impact response times.

In a future version of SpeziLLM, we plan to expose performance metrics such as tokens per second and time to first token (generation tokens per second), that could help you identifying the issue here.

As a workaround, you might consider using a smaller model or reducing the context window.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@LeonNissen
Comment options

Answer selected by philippzagar
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants