Replies: 2 comments 6 replies
-
Hi everyone @lorebrada @siddh34, hope you guys had a great day. Finally, we can move on with the development process and introduce a new great set of monitoring functions. Some months ago, I opened this issue #78 and, as I said to @siddh34 a couple of days ago, the issue will be moved to a "Permanent" issue to organize and introduce the new development processes. What metrics can we collect? eBPF gives us the power to access and collect kernel insights, especially when we talk about latency and network packet processing. I think that we can start by collecting those metrics (packet volume, accepted packets, dropped packets, etc) from all the internal cluster containers, for this first step, the technical difficulties are mainly related to extracting the data from the kernel because of the huge amount of nested kernel structures, especially in the net_device struct and sk_buff struct. I think that we will start having more difficulties once we start aggregating all the data in a dashboard. I was thinking about using protocol buffers for serializing and deserializing data at high speed, and Kafka for the streaming process What do you think about that? |
Beta Was this translation helpful? Give feedback.
-
Sorry for late reply @LorenzoTettamanti I was going to reply yesterday but I wasn't able to reply here For question, which metrics will be most valuable for edge vs cloud workloads? I would suggest events related to Network Reliability/Latency & Security events will be the most important Incase of alerts we can have integration with slack or may a ms teams bot For custom dashboard we can even try to send logs to platform such as newrelic etc |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey team 👋 How are you??
@LorenzoTettamanti
@PranavVerma-droid
@siddh34
As we continue building out the observability layer of CortexFlow and its eBPF-based system, I wanted to open this discussion to brainstorm what kinds of metrics, traces, and events we should capture, monitor, and surface : both at runtime and for longer-term analysis. I personally think we're in a great position to go beyond traditional metrics.
As you all know, eBPF allows us to introspect kernel-level and application-level behavior with very low overhead, so we can be creative and ambitious with what we collect.
I’ve also been thinking about reimagining a monitoring system that goes far beyond Prometheus, in terms of efficiency and data persistence , but I’ll open a dedicated discussion on that, later today.
Categories of Metrics to Consider
1. Network Metrics
2. System Metrics
3. Security & Policy Events
4. Service Mesh & Control Plane Metrics
5. Custom Application Hooks
Goals for Metric Collection
Questions for the Team
metrics spec
file format?Do not hesitate to drop your thoughts, metrics wishlists, or existing tools you like below!
We’ll consolidate this discussion into a design proposal once we’ve got enough input.
Looking forward to your ideas
— @lorebrada
Beta Was this translation helpful? Give feedback.
All reactions