What Metrics Should We Trace, Monitor, and Analyze with eBPF? #110

lorebrada · 2025-06-26T08:19:41Z

lorebrada
Jun 26, 2025
Maintainer

Hey team 👋 How are you??
@LorenzoTettamanti
@PranavVerma-droid
@siddh34

As we continue building out the observability layer of CortexFlow and its eBPF-based system, I wanted to open this discussion to brainstorm what kinds of metrics, traces, and events we should capture, monitor, and surface : both at runtime and for longer-term analysis. I personally think we're in a great position to go beyond traditional metrics.

As you all know, eBPF allows us to introspect kernel-level and application-level behavior with very low overhead, so we can be creative and ambitious with what we collect.

I’ve also been thinking about reimagining a monitoring system that goes far beyond Prometheus, in terms of efficiency and data persistence , but I’ll open a dedicated discussion on that, later today.

Categories of Metrics to Consider

1. Network Metrics

Per-process/namespace packet in/out (bytes, packets)
RTT / latency distributions per connection
TCP retransmissions, dropped packets
DNS resolution latency and errors
Socket lifecycle traces (open, bind, close)

2. System Metrics

Syscall frequency and latency per service
File I/O (read/write ops, bytes per file or device)
CPU cycles per process/container
Context switch frequency and duration
Page faults, memory access patterns

3. Security & Policy Events

Unauthorized syscall attempts (SECcomp, CAPs violations)
Unexpected privilege escalations
Process execution chains (execve tree, container escapes)
Network ACL or policy violation attempts
Kernel hooks for audit logging

4. Service Mesh & Control Plane Metrics

Sidecar-less routing tracepoints (per service latency, hops)
Flow-level tracing of service-to-service communication
Control plane heartbeat and config drift detection
TLS handshake timings (especially for mTLS-enabled flows)

5. Custom Application Hooks

USDT probes from user-space apps
Custom tags for business logic metrics (e.g. queue wait time)
Instrumented events injected into trace pipelines

Goals for Metric Collection

Low-overhead observability (keep overhead under 1%)
Real-time and historical aggregation (e.g. via Prometheus or OTEL)
Multi-layer correlation (e.g. syscall latency ↔ service latency)
Platform-agnostic design (support Kubernetes, Docker, Nomad, etc.)

Questions for the Team

What metrics do we need from day one for debugging, security, and SLOs?
Which metrics will be most valuable for edge vs cloud workloads?
Should we create a standard CortexFlow metrics spec file format?
What kind of dashboards, alerts, or analysis tools should we build on top?

Do not hesitate to drop your thoughts, metrics wishlists, or existing tools you like below!
We’ll consolidate this discussion into a design proposal once we’ve got enough input.

Looking forward to your ideas
— @lorebrada

LorenzoTettamanti · 2025-07-14T17:15:01Z

LorenzoTettamanti
Jul 14, 2025
Maintainer

Hi everyone @lorebrada @siddh34, hope you guys had a great day.

Finally, we can move on with the development process and introduce a new great set of monitoring functions. Some months ago, I opened this issue #78 and, as I said to @siddh34 a couple of days ago, the issue will be moved to a "Permanent" issue to organize and introduce the new development processes.

What metrics can we collect?

eBPF gives us the power to access and collect kernel insights, especially when we talk about latency and network packet processing. I think that we can start by collecting those metrics (packet volume, accepted packets, dropped packets, etc) from all the internal cluster containers, for this first step, the technical difficulties are mainly related to extracting the data from the kernel because of the huge amount of nested kernel structures, especially in the net_device struct and sk_buff struct. I think that we will start having more difficulties once we start aggregating all the data in a dashboard. I was thinking about using protocol buffers for serializing and deserializing data at high speed, and Kafka for the streaming process

What do you think about that?

1 reply

siddh34 Jul 14, 2025
Collaborator

Sounds good

siddh34 · 2025-07-14T18:17:56Z

siddh34
Jul 14, 2025
Collaborator

Sorry for late reply @LorenzoTettamanti I was going to reply yesterday but I wasn't able to reply here

For question, which metrics will be most valuable for edge vs cloud workloads?

I would suggest events related to Network Reliability/Latency & Security events will be the most important

Incase of alerts we can have integration with slack or may a ms teams bot

For custom dashboard we can even try to send logs to platform such as newrelic etc

5 replies

LorenzoTettamanti Jul 14, 2025
Maintainer

Yes @siddh34 , we can start with all the networking events such as latency, throughput, packets dropped and bandwidth usage maybe for both ingress and egress events.

Also, I think that we need to put extra attention to the context because all the data that we extract can be hard to interpret without a proper cluster context. Ideally, I think we should use an identifier (can be a generated hash or a PID, UPID, etc) that can help us aggregate all the data from the kernel space in the user space.

I'm trying to figure out what possible bottlenecks we can might encounter. Does anyone have any idea/insights or an example that we can discuss?

Not speaking about networking, do you think it's possible to collect also GPU usage metrics? Nowadays, a lot of work is done by GPUs, so having the ability to also track GPU events would be amazing imo.

I also agree with the idea of creating an alert manager linked with Slack or MS Teams

siddh34 Jul 15, 2025
Collaborator

For now I think the biggest bottleneck is unnecessary data received from kernel which makes everything slow.

A good idea is to aggregate data and know whats worth printing and what we don't want to print.

I need to research about GPU metrics although I have a integrated intel GPU. I was reading this yesterday https://eunomia.dev/tutorials/47-cuda-events/

sure will let you know @LorenzoTettamanti

LorenzoTettamanti Jul 15, 2025
Maintainer

Sure @siddh34, that's a great idea! Initially, we can also collect the network namespace and use that for preliminary data separation.

I'm reviewing the article and it seems very useful. Starting this Friday, I’ll be able to work full-time on implementing a similar solution.
I've already started working on issue #105, but I need a bit more time to come up with a great solution

I’ll keep you all posted here if I come up with any great insights.

LorenzoTettamanti Jul 18, 2025
Maintainer

Hi @siddh34 , I've opened a couple of new core issues (#117 and #119).

I've already pushed a bunch of commits related to issue #117 in the feature/ebpf-core branch, these commits are already tested with chaos mesh. If you're free we can start implementing all the other metrics together 🚀🚀.

siddh34 Jul 18, 2025
Collaborator

Sure @LorenzoTettamanti. I will take a look and start with it! 🚀🚀

CortexFlow

What Metrics Should We Trace, Monitor, and Analyze with eBPF? #110

Uh oh!

lorebrada Jun 26, 2025 Maintainer

Hey team 👋 How are you?? @LorenzoTettamanti @PranavVerma-droid @siddh34

Categories of Metrics to Consider

1. Network Metrics

2. System Metrics

3. Security & Policy Events

4. Service Mesh & Control Plane Metrics

5. Custom Application Hooks

Goals for Metric Collection

Questions for the Team

Replies: 2 comments · 6 replies

Uh oh!

Uh oh!

LorenzoTettamanti Jul 14, 2025 Maintainer

Uh oh!

siddh34 Jul 14, 2025 Collaborator

Uh oh!

Uh oh!

siddh34 Jul 14, 2025 Collaborator

Uh oh!

LorenzoTettamanti Jul 14, 2025 Maintainer

Uh oh!

siddh34 Jul 15, 2025 Collaborator

Uh oh!

LorenzoTettamanti Jul 15, 2025 Maintainer

Uh oh!

LorenzoTettamanti Jul 18, 2025 Maintainer

Uh oh!

siddh34 Jul 18, 2025 Collaborator

lorebrada
Jun 26, 2025
Maintainer

Hey team 👋 How are you??
@LorenzoTettamanti
@PranavVerma-droid
@siddh34

Replies: 2 comments 6 replies

LorenzoTettamanti
Jul 14, 2025
Maintainer

siddh34 Jul 14, 2025
Collaborator

siddh34
Jul 14, 2025
Collaborator

LorenzoTettamanti Jul 14, 2025
Maintainer

siddh34 Jul 15, 2025
Collaborator

LorenzoTettamanti Jul 15, 2025
Maintainer

LorenzoTettamanti Jul 18, 2025
Maintainer

siddh34 Jul 18, 2025
Collaborator