Do We Need a Better Prometheus-Like System? #111

lorebrada · 2025-06-26T09:11:39Z

lorebrada
Jun 26, 2025
Maintainer

Hello everyone,

As we consider enhancing the observability layer of CortexFlow with eBPF-driven insights, I’d like to start a discussion around why and how we might build a better Prometheus-like monitoring system, tailored for our requirements.

Current Limitations of Prometheus

1. High-cardinality & label explosion

Labels like user_id, query_id or request_id can lead to millions of unique time series, which drastically degrades performance. For example, Prometheus scraping pg_stat_statements with queryid labels can produce thousands of series—one user noted ~5000 per instance ([betterstack.com][1], [github.com][2]).
Teams have reported OOMs, query slowdowns, storage stress due to cardinality blooms ([betterstack.com][3]).

2. Resource and performance strain

In IoT scenarios with millions of unique IDs, Prometheus struggles despite powerful hardware — issuing warnings like "Storage has entered rushed mode" and eventually skipping scrapes or rules ([groups.google.com][4]).
Query latency can surge drastically when dealing with hundreds of thousands to millions of series (e.g., querying 10M series can take 15 minutes) ([read.srepath.com][5]).

3. Storage and retention inflexibility

Prometheus’ TSDB is optimized for recent data, and struggles with long-term retention. Users often resort to managing multiple instances or layered systems (e.g., Thanos, Mimir) to compensate ([reddit.com][6]).
Remote file systems like NFS/EFS aren’t well-supported, and performance can degrade or risk corruption ([reddit.com][7]).

Some Core Questions for CortexFlow

How can we design for high-cardinality use cases?
- Should we automatically sample or down-aggregate metrics with large cardinality?
- Can we separate trace-like identifiers from aggregated metrics?
What retention models do we need?
- Do we need multi-tier retention (e.g., hot, warm, cold) within the same system?
- How long must raw data vs aggregated metrics be stored?
How tight should integration with eBPF be?
- Can we embed tag-enrichment or event correlation at capture time to avoid post-processing?
- Should metrics flow directly to CortexFlow’s data paths (e.g., Kafka → TSDB) instead of a Prometheus scrape?
What query/alerting model do we want?
- Do we wish to develop a PromQL-compatible engine or something more tailored to eBPF insights?
- How critical is interactive querying vs long-term aggregation?
Can we maintain low overhead and reliability?
- Can we keep collection overhead <1%?
- How do we ensure consistency, especially in distributed or ephemeral environments?

I'd love to hear your thoughts:

Have you encountered cardinality or storage pains with Prometheus at scale?
What must-have features or retention policies would make a better fit for eBPF-driven workloads?
Are there existing tools or architectures (e.g., Cortex, Thanos, Mimir, VictoriaMetrics) we should learn from?

Let’s discuss, iterate, and form a design proposal with clear goals for efficiency, scale, and developer experience.

Final Take

P.S.: This discussion is not meant to be an exhaustive list of Prometheus' limitations, nor a final judgment. Prometheus has been a foundational tool in modern observability, and much of what we know today builds on its success.

That said, our unique context with CortexFlow, eBPF-native metrics, and hybrid edge-cloud deployments presents an opportunity to rethink what a next-generation monitoring system could look like. This thread is meant to raise questions, explore pain points, and collect perspectives from anyone who’s felt the limits of existing tools.

Feel free to challenge, extend, or reshape the points above. The more edge cases, horror stories, or "aha!" insights we gather here, the better the eventual design will be.

— @lorebrada

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CortexFlow

Do We Need a Better Prometheus-Like System? #111

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

CortexFlow

Do We Need a Better Prometheus-Like System? #111

Uh oh!

lorebrada Jun 26, 2025 Maintainer

Hello everyone,

Current Limitations of Prometheus

1. High-cardinality & label explosion

2. Resource and performance strain

3. Storage and retention inflexibility

Some Core Questions for CortexFlow

I'd love to hear your thoughts:

Final Take

Replies: 0 comments

lorebrada
Jun 26, 2025
Maintainer