LocalLLM TokenIterator is taking a good chunk of time during my chats. Am I doing something wrong? #87

bryan1anderson · 2025-03-19T15:12:49Z

bryan1anderson
Mar 19, 2025

What Stanford Spezi module is your challenge related to?

Spezi

Description

public func generate(
    input: LMInput, parameters: GenerateParameters, context: ModelContext,
    didGenerate: ([Int]) -> GenerateDisposition
) throws -> GenerateResult {
    let iterator = try TokenIterator(
        input: input, model: context.model, parameters: parameters) //This can take a few seconds
    return generate(
        input: input, context: context, iterator: iterator, didGenerate: didGenerate)
}

Reproduction

Each call to session.generate() has a 3-4 second delay between streams.
Meaning when I call generate, there is a 4 second delay before the stream starts outputting text.

Expected behavior

I'm looking for a way to get the generator to start writing sooner without that delay.

Additional context

No response

Code of Conduct

I agree to follow this projects's Code of Conduct and Contributing Guidelines

Answered by LeonNissen

Mar 19, 2025

Hi @bryan1anderson,

The initial call may take longer (since the model needs to be loaded into memory), but subsequent runs should be faster. However, as @philippzagar correctly pointed out, factors like context window size, model type, and available resources can still impact response times.

In a future version of SpeziLLM, we plan to expose performance metrics such as tokens per second and time to first token (generation tokens per second), that could help you identifying the issue here.

As a workaround, you might consider using a smaller model or reducing the context window.

View full answer

philippzagar · 2025-03-19T15:33:22Z

philippzagar
Mar 19, 2025
Collaborator

Once the model is initialized, the 3-4 seconds you mentioned are likely the time required for the LLM to process the input before it begins generating output tokens. Meaning, a longer input will lead to a longer delay in processing input tokens. There's not really anything we can do about that, we're simply working with constraint resources on the local device.

@LeonNissen Not sure if there's more to it (I doubt it), might you have any additional insights?

1 reply

LeonNissen Mar 19, 2025
Collaborator

Hi @bryan1anderson,

The initial call may take longer (since the model needs to be loaded into memory), but subsequent runs should be faster. However, as @philippzagar correctly pointed out, factors like context window size, model type, and available resources can still impact response times.

In a future version of SpeziLLM, we plan to expose performance metrics such as tokens per second and time to first token (generation tokens per second), that could help you identifying the issue here.

As a workaround, you might consider using a smaller model or reducing the context window.

Answer selected by philippzagar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stanford Spezi

LocalLLM TokenIterator is taking a good chunk of time during my chats. Am I doing something wrong? #87

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Stanford Spezi

LocalLLM TokenIterator is taking a good chunk of time during my chats. Am I doing something wrong? #87

Uh oh!

Uh oh!

bryan1anderson Mar 19, 2025

What Stanford Spezi module is your challenge related to?

Description

Reproduction

Expected behavior

Additional context

Code of Conduct

Replies: 1 comment · 1 reply

Uh oh!

philippzagar Mar 19, 2025 Collaborator

Uh oh!

LeonNissen Mar 19, 2025 Collaborator

bryan1anderson
Mar 19, 2025

Replies: 1 comment 1 reply

philippzagar
Mar 19, 2025
Collaborator

LeonNissen Mar 19, 2025
Collaborator