Skip to content

Commit f421d2a

Browse files
authored
Merge pull request #9 from NicholasGoh/feat/monitoring-observability
Feat/monitoring observability
2 parents 94a1228 + f0b471b commit f421d2a

File tree

6 files changed

+188
-0
lines changed

6 files changed

+188
-0
lines changed
Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
---
2+
slug: monitoring-and-observability
3+
title: Monitoring and Observability
4+
authors: [nicholas]
5+
tags: [llm-monitoring, llm-observability]
6+
---
7+
8+
import ReactPlayer from 'react-player'
9+
10+
## Demo
11+
12+
Check out the following interactive dashboards, [Grafana](https://nicholas-goh.com/grafana) and [Langfuse](https://nicholas-goh.com/langfuse), before I dive into the blog!
13+
14+
Username and password:
15+
16+
- `demo@demo.com`
17+
- `D3m@123456`
18+
19+
### Grafana
20+
21+
<ReactPlayer playing controls url='/vid/monitoring-and-observability/grafana.mp4' />
22+
23+
<!-- truncate -->
24+
25+
### Langfuse
26+
27+
<ReactPlayer playing controls url='/vid/monitoring-and-observability/langfuse.mp4' />
28+
29+
## Introduction
30+
31+
In this blog, I dive deeper into the tools I found particularly useful while developing a [complex agentic system](/blog/customer-service-automation). Previously, I only touched on this topic briefly, sharing static snapshots of the technologies involved due to limitations in showcasing public-facing interactive dashboards. This blog offers solutions to that challenge.
32+
33+
## Monitoring: Enhancing Cost Tracking with Latency Metrics
34+
35+
### Native Monitoring with OpenAI: Token Usage and Cost
36+
37+
OpenAI provides a built-in dashboard for monitoring token usage, which offers the following benefits:
38+
39+
- **Minimal setup** — simply provide an API key.
40+
- **Filterable analytics** — view usage by model and date.
41+
- **Clear breakdowns** — number of requests, prompt and completion tokens, and cost per model.
42+
43+
#### Token Usage Dashboard
44+
45+
![OpenAI Tokens](./openai-tokens.png)
46+
47+
#### Cost Usage Dashboard
48+
49+
![OpenAI Costs](./openai-costs.png)
50+
51+
While the built-in monitoring is great for tracking usage and cost, it doesn’t surface latency metrics for individual requests — something I’ve found increasingly important to capture elsewhere.
52+
53+
:::tip[Latency Tracking in Context]
54+
55+
It probably makes more sense to handle latency tracking within the development and production environment, since that naturally includes not just model inference time but also network overhead, retries, and any local delays. This gives a more realistic picture of end-to-end performance as experienced by users.
56+
57+
:::
58+
59+
This lack of latency visibility becomes a limitation in more **complex agentic systems**, where understanding bottlenecks across chains of reasoning or worker nodes is key. For example:
60+
61+
- Is the delay in the supervisor node?
62+
- Is a database tool or tool-use step slowing things down?
63+
- Am I spending time waiting on slow responses from specific models?
64+
65+
I’m not planning to switch cloud LLM providers, but I want to stay flexible. Relying solely on OpenAI’s dashboards introduces a kind of **vendor lock-in** in monitoring visibility and granularity.
66+
67+
### Migrating to Grafana: Adding Latency and Flexibility
68+
69+
Grafana's [monitoring repository](https://github.com/grafana/grafana-openai-monitoring) provides and out of the box way to monitor usage and latency metrics. However, it only supports Grafana Cloud which defeats the purpose of not having a public-facing interactive dashboard.
70+
71+
:::note[Public Dashboard Limitations]
72+
73+
Although externally shared dashboards are possible, they are [limited](https://grafana.com/docs/grafana/latest/dashboards/share-dashboards-panels/shared-dashboards/#limitations). As such, I self hosted Grafana stack as follows:
74+
75+
<details>
76+
77+
<summary>Grafana Stack</summary>
78+
79+
```mermaid
80+
graph TD
81+
A[API]
82+
A --> B
83+
A --> C
84+
85+
subgraph Gather Metrics
86+
B[Pushgateway]
87+
C[Loki]
88+
D[Prometheus]
89+
D --> B
90+
end
91+
92+
subgraph Visualize Metrics
93+
E[Grafana]
94+
E --> C
95+
E --> D
96+
end
97+
98+
subgraph LEGEND
99+
L1[Docker Container]
100+
end
101+
```
102+
103+
</details>
104+
105+
:::
106+
107+
#### Adapting for Streaming Completions
108+
109+
Grafana’s example setup does not support streaming completions natively. I made the following changes to accommodate that:
110+
111+
##### Challenges with Prometheus
112+
113+
| Issue | Description |
114+
|-------------------|----------------------------------------------------------------------------|
115+
| Short-lived jobs | Prometheus is designed to scrape metrics from long-lived jobs like `/metrics` endpoints. |
116+
| Incompatibility | Streaming completions are short-lived and not easily integrated with the Prometheus Python client. |
117+
118+
#### Solutions Implemented
119+
120+
- Pushgateway Integration
121+
- Enables support for short-lived jobs.
122+
- Each completion (after the full stream ends) pushes usage metrics to Pushgateway.
123+
- Prometheus scrapes metrics from Pushgateway instead of directly from the short-lived job.
124+
- Streaming Behavior
125+
- Metrics are not pushed per token, but only once per full completion.
126+
- This reduces metric noise and keeps the tracking efficient.
127+
- Loki for Completion Logs
128+
- Completion events are logged into Loki.
129+
- This provides visibility into individual requests, helpful for debugging and tracing.
130+
- Grafana Dashboards
131+
- Visualizes both usage metrics (from Prometheus) and event logs (from Loki).
132+
- Enables monitoring of latency, request volume, and real-time logs in one interface.
133+
134+
See below for the same demo video as [above](#demo).
135+
136+
#### Grafana Demo
137+
138+
<ReactPlayer playing controls url='/vid/monitoring-and-observability/grafana.mp4' />
139+
140+
<br />
141+
142+
The Loki logs demoed at the end of the video provide a concise overview of input, output, and the project environment. However, I found that I need more observability into what's happening between input and output. Specifically, I should be able to see the internal routing, such as how the supervisor receives the prompt, delegates it to workers, and how they solve it using tools if needed.
143+
144+
## Tracing: LLM Observability
145+
146+
### Langsmith: Dynamic Tracing, Static Public Sharing
147+
148+
I previously used Langsmith due to its minimal setup, which only requires an API key.
149+
150+
The native dashboard provides valuable features, including:
151+
152+
- Tracing each LLM call.
153+
- Maintaining a node hierarchy, making it clear what each supervisor or worker receives as input and output.
154+
- Displaying the latency and cost of each node.
155+
156+
These features significantly aided my development and debugging process by:
157+
158+
- Helping me pinpoint where prompt engineering issues occurred.
159+
- Identifying potential optimizations for nodes and prompts to reduce processing time.
160+
161+
#### Langsmith Demo
162+
163+
<ReactPlayer playing controls url='/vid/monitoring-and-observability/langsmith.mp4' />
164+
165+
<br />
166+
167+
As previously mentioned, Langsmith does not offer a public-facing interactive dashboard. In earlier blog posts, I shared static snapshots of traces as a workaround. Below, I explore one solution for exposing a public-facing interactive dashboard to enhance observability.
168+
169+
### Langfuse: Dynamic Tracing with Public Dashboard
170+
171+
**Langfuse offers many features similar to Langsmith, with several additional enhancements:**
172+
173+
**Interactive flow diagram**:
174+
- Visualizes the execution flow between nodes, making it easier to understand complex call chains at a glance.
175+
176+
**Clickable nodes**:
177+
- Each node in the diagram is interactive—clicking on one navigates to its position in the node hierarchy.
178+
179+
**Detailed node insights**:
180+
- Upon selecting a node, Langfuse provides detailed information such as:
181+
- Inputs and outputs
182+
- Execution latency and associated cost
183+
184+
Furthermore, I can expose a public-facing interactive dashboard via a demo account.
185+
186+
#### Langfuse Demo
187+
188+
<ReactPlayer playing controls url='/vid/monitoring-and-observability/langfuse.mp4' />
Loading
Loading
Binary file not shown.
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)