initial attempt to capture perf experiment stats #114

yangm2 · 2025-06-16T20:47:17Z

I've added additional metadata into the chatlog.jsonl:

model
model_reasoning_effort
chunk_times

The first 2 are basically data tags in order to split/pivot for future analysis (e.g. o3 vs o4, medium vs high effort). More tags can be added in the future.

In addition, there's a high-level aggregation of the chunk_times data that gets printed to the console after each response:

total number of chunks
time to first chunk (thinking time?)
average of subsequent chunks
total time

Additional aggregations can be added now or later.

Example of what shows up in the chatlog.jsonl (formatted for readability):

{
   "messages": [{
         "role": "user",
         "content": "r u free?"}, {
            "role": "assistant", 
            "content": "Yes, I’m available. What do you need help with regarding your Oregon eviction situation?"}], 
   "metadata": {
      "session_id": "c0aab2f4-cc0d-4cf1-852b-4bdb149571c6", 
      "model": "o3", 
      "model_reasoning_effort": "medium", 
      "chunk_times": [
         "2025-06-16T20:26:14.733+00:00",
         "2025-06-16T20:26:18.355+00:00",
         "2025-06-16T20:26:18.356+00:00",
         "2025-06-16T20:26:18.357+00:00",
         "2025-06-16T20:26:18.393+00:00",
         "2025-06-16T20:26:18.394+00:00",
         "2025-06-16T20:26:18.396+00:00",
         "2025-06-16T20:26:18.461+00:00",
         "2025-06-16T20:26:18.463+00:00",
         "2025-06-16T20:26:18.465+00:00",
         "2025-06-16T20:26:18.508+00:00",
         "2025-06-16T20:26:18.508+00:00",
         "2025-06-16T20:26:18.509+00:00",
         "2025-06-16T20:26:18.520+00:00",
         "2025-06-16T20:26:18.522+00:00",
         "2025-06-16T20:26:18.528+00:00",
         "2025-06-16T20:26:18.551+00:00",
         "2025-06-16T20:26:18.553+00:00",
         "2025-06-16T20:26:18.555+00:00"], 
      "ts": "2025-06-16T20:26:18.872+00:00"}}

And here's an example of what shows up on the console:

18 chunks
  3.623 first chunk time (seconds)
  0.029 average chunk time after first chunk (seconds)
  4.139 total seconds

A separate post-processing script to process many jsonl chat logs, aggregate and plot them can be written in the future. I would need a better picture of how folks plan to collect the jsonl logs and how they want to analyze the data.

I'm looking for feedback whether this is:

capturing the data we are interested in
putting this in a usable format/location

yangm2 self-assigned this Jun 16, 2025

yangm2 added the backend Bot implementation and other backend concerns label Jun 16, 2025

yangm2 marked this pull request as ready for review June 17, 2025 03:09

yangm2 requested review from wittejm and apkostka June 17, 2025 17:33

initial attempt to capture perf experiment stats

21f77f8

yangm2 force-pushed the telemetry branch from 183c6b8 to 21f77f8 Compare June 17, 2025 18:31

KentShikama approved these changes Jun 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

initial attempt to capture perf experiment stats #114

initial attempt to capture perf experiment stats #114

Uh oh!

yangm2 commented Jun 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

initial attempt to capture perf experiment stats #114

Are you sure you want to change the base?

initial attempt to capture perf experiment stats #114

Uh oh!

Conversation

yangm2 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yangm2 commented Jun 16, 2025 •

edited

Loading