Skip to content

Conversation

marcvernet31
Copy link

Issue #, if available: #3106

Description of changes:
Displays detailed token performance metrics (tokens/sec, TTFT, latency) when using verbose flags.

  • performance.rs - New module calculating token metrics from request metadata
  • mod.rs - Integration logic and styled display output to stderr
  • parser.rs - Enhanced metadata support for performance tracking

On this implementation the metrics will be displayed for all levels of verbose, but it can be discussed if it should be displayed only on a specific level or if we can have a specific flag for them (--metrics for example).

Example output:

> hello

> Hello! I'm Amazon Q, an AI assistant built by AWS. I'm here to help you with AWS services, development tasks, 
infrastructure management, and more. 

What can I help you with today?

┌─ Performance Metrics ────────────────────────────────────────┐
│ Performance Metrics:
│   Tokens/sec: 0.0
│   TTFT: 2766ms
│   Total duration: 2.8s
│   Generation time: 0ms
│   Avg inter-token latency: 0.0ms
│   Total tokens: 10 (40 prompt + 0 completion)
│   Prompt processing: 2766ms

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

}

/// Get the current verbose level from CLI args or environment variables
fn get_verbose_level(&self) -> u8 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of doing this type of argument parsing - I would rather we support a new slash command like /debug that can be extended more easily, for example here we could have /debug metrics that could print out the last request metrics.

Is it necessary to do this printing on every request though? If so, we could have a flag that gets toggled on/off through /debug metrics toggle maybe

}

/// Track tokens for the given content and record timing information
fn track_tokens(&mut self, content: &str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be pretty inaccurate - the token estimate alone is not accurate, and if we're doing this on tiny content chunks then the actual error amount is going to be huge.

Should stick to counting chars instead, and then estimate tokens from that

/// Time when first token was received
first_token_time: Option<Instant>,
/// Incremental token counts with timestamps (duration from request start in ms, token count)
token_timestamps: Vec<(u64, usize)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from time_between_chunks?

/// Total tokens generated in the response so far
total_tokens: usize,
/// Time when first token was received
first_token_time: Option<Instant>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from time_to_first_chunk?


/// Token count metrics for performance calculation
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TokenMetrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't define TokenMetrics like this since it's all based on inaccurate estimates rather than concrete token usage.

Why can't this be derived from RequestMetadata? This appears to be all duplicated except using a token estimate instead

@brandonskiser
Copy link
Contributor

Hey, thanks! Can you also run cargo +nightly fmt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants