How to tune Mimir for an expensive tenant #11201
Replies: 3 comments 2 replies
-
Also a follow-up question, which is a better signal to look at for how long a query would take?
|
Beta Was this translation helpful? Give feedback.
-
@pracucci Tagging since you're so helpful and knowledgeable :) |
Beta Was this translation helpful? Give feedback.
-
The query I suggest you the following investigation course:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello Mimir community,
Recently, we have been trying to tackle query performance issues with a particular metric,
inbound_latency
, on a particular tenant,hat
.Here are the details of our situation:
histogram_count(sum by (http_status) (delta(inbound_latency{}[2m])))
. A time window of 6 hours takes about 10-15 sec to complete, and a time window of 12 hours takes 20-30 sec to complete. Any longer just times out in Grafana.inbound_latency
has 14 labels associated with it. Its cardinality is on the order of 10^13, so in the trillions. The way I'm defining cardinality here is going through each label and multiplying the number of unique values all together./prometheus/api/v1/cardinality/active_series
on this metric returns 135671, which is the largest value of all of our metrics.ingester
has 52 containers, each with 8 CPU and 14 GB memorystore-gateway
has 10 containers, each with 4 CPU and 4 GB memoryProblem
The issue I'm running into is that our InfluxDB equivalent for this data (same tags for the labels we have in Mimir) can query 30 days into the past, whereas we can barely hit a single day in Mimir. One thing we tried was to increase the shard size for
ingester
andstore-gateway
, but that seemingly had no affect on query performance. This was the change we attempted:The thinking was that, maybe originally, just having 3 instances of
ingester
andstore-gateway
each was not sufficient in serving our most expensive tenant, so we should increase the number of instances serving these requests. But because we observed no changes, I'm wondering if we are going about this incorrectly.How can we tune our Mimir deployment to better serve this particular tenant? Please let me know if I can provide any more details about the scenario.
Beta Was this translation helpful? Give feedback.
All reactions