Split ebpf metrics and energy estimation into two exporters #1088

rootfs · 2023-11-28T00:47:46Z

rootfs
Nov 28, 2023
Maintainer

Currently Kepler exports both ebpf metrics and energy estimation, the energy estimation is based a ML model with the ebpf metrics features.

This monolithic and sync (i.e. all metrics are produced synchronously) architecture has a number of advantage in deployment, upgrade in the early days. Now I think we see scenarios that favor a decoupled and async architecture that produce eBPF metrics and energy metrics asynchronously. This architecture consists two exporters: ebpf metrics and energy estimation. The ebpf metrics exporter just exports ebpf metrics, while the energy estimation exporter reads the ebpf metrics, uses ML models and then exports the energy estimates.

The async architecture has the following advantages, especially the energy estimate exporter can be used to predict energy from offline ebpf metrics, while the sync exporter has to export both metrics synchronously.

For CPUs that don't have a pretrained model, the sync architecture doesn't have the best estimate. While the async exporters can wait till the model is available to predict the energy usage.
The async architecture can support error analysis, A/B tests of different models, and support models that are written in non-golang (as in kepler-estimator).

Comments are welcome.

@sustainable-computing-io/maintainer

marceloamaral · 2023-11-28T05:52:29Z

marceloamaral
Nov 28, 2023
Maintainer

@rootfs and @sunya-ch, we can decouple the metric collection and the estimation part within the Kepler exporter. This will involve dedicating one container to collect all metrics and another to receive these metrics and perform power estimation using Power models.

To illustrate this architecture, see the diagram below:

We should keep both the metric collection and estimation per node due to scalability reasons. For example, when estimating the process power, if each node has 1k processes in a cluster with 500 nodes the overhead will be very high for a centralized estimator.
Also because of performance and scalability reasons, each estimator should not get metrics from Prometheus, otherwise we will have too much overhead in Prometheus collecting metrics. Instead, each estimator should talk with each metric collector within a node.
We need to collect all metrics together, that is the eBPF and the node energy metrics to keep them synchronize. Collecting metrics in different containers would have different pooling interval, making it necessary to exchange synchronization information or drop some data.
We will remove the Golang power estimation segment, we'll retain solely the Python code.
The estimation container is designed to compute process power or node power (in cases where VMs lack real-time power metrics). As the default configuration, we will distribute the node power (collected or estimated) per process through the ratio approach.
Currently, there is no use case for online power model training. Model training is exclusively conducted in bare-metal scenarios, where we opt for the ratio power model due to its simplicity and high accuracy.

0 replies

sunya-ch · 2023-11-28T09:37:45Z

sunya-ch
Nov 28, 2023
Maintainer

For supplementary comment, about gRPC proto, we can think it is an alternative to expose metric collected by Kepler metric collector (in addition to prometheus format with /metrics endpoints).
It allows us to query only specific group of metrics we need instead of getting all metrics and do the parsing at the client side.
This proto results will be easier to be consumed by the other application not limited to the estimator.

Even if query call to prometheus metric server can similarly filter the response with specific query set; however, again, as @marceloamaral mentioned above, directly connecting to kepler collector will reduce the query load to the prometheus metric server.

0 replies

rootfs · 2023-11-28T13:15:52Z

rootfs
Nov 28, 2023
Maintainer Author

@marceloamaral @sunya-ch build on top of the gRPC idea. Kepler eBPF collector could use telemetry library and the kepler power estimator is the metrics collector and export the metrics to prometheus, wdyt?

2 replies

marceloamaral Nov 30, 2023
Maintainer

There's always a trade-off when considering distributed versus centralized architecture.

As SmartWatts, the centralized architecture may encounter scalability issues. For instance, with 2k processes on a node within a cluster of 500 nodes, the centralized deployment would need to process 1000k requests.

Therefore, I believe it would be more advantageous to implement the energy estimator as a Daemonset.

marceloamaral Nov 30, 2023
Maintainer

Also, we should collect the eBPF and node power metrics (RAPL) in the same Daemonset to avoid synchronization problem when collecting the metrics.

But all energy estimation (not collection), which requires ML algorithm can be separated to a different Daemonset.

SamYuan1990 · 2024-01-01T14:20:01Z

SamYuan1990
Jan 1, 2024
Maintainer

does it mean we need to rewrite/move some features from kepler to kepler model server?

1 reply

marceloamaral Jan 15, 2024
Maintainer

@SamYuan1990, it needs to be different because only one model server operates in the cluster.

Pooling all process estimations from every machine into the model server can create a bottleneck and scalability issues.

Instead, it's better to have one daemon for metric collection and another for estimating power in each node, then aggregate the results for containers and VMs and export to Prometheus as we do now.

Split ebpf metrics and energy estimation into two exporters #1088

Uh oh!

Uh oh!

rootfs Nov 28, 2023 Maintainer

Replies: 4 comments · 3 replies

Uh oh!

marceloamaral Nov 28, 2023 Maintainer

Uh oh!

Uh oh!

sunya-ch Nov 28, 2023 Maintainer

Uh oh!

Uh oh!

rootfs Nov 28, 2023 Maintainer Author

Uh oh!

marceloamaral Nov 30, 2023 Maintainer

Uh oh!

marceloamaral Nov 30, 2023 Maintainer

Uh oh!

SamYuan1990 Jan 1, 2024 Maintainer

Uh oh!

marceloamaral Jan 15, 2024 Maintainer

rootfs
Nov 28, 2023
Maintainer

Replies: 4 comments 3 replies

marceloamaral
Nov 28, 2023
Maintainer

sunya-ch
Nov 28, 2023
Maintainer

rootfs
Nov 28, 2023
Maintainer Author

marceloamaral Nov 30, 2023
Maintainer

marceloamaral Nov 30, 2023
Maintainer

SamYuan1990
Jan 1, 2024
Maintainer

marceloamaral Jan 15, 2024
Maintainer