How to tune Mimir for an expensive tenant #11201

ericlee123 · 2025-04-12T05:22:16Z

ericlee123
Apr 12, 2025

Hello Mimir community,

Recently, we have been trying to tackle query performance issues with a particular metric, inbound_latency, on a particular tenant, hat.

Here are the details of our situation:

The query we are running is: histogram_count(sum by (http_status) (delta(inbound_latency{}[2m]))). A time window of 6 hours takes about 10-15 sec to complete, and a time window of 12 hours takes 20-30 sec to complete. Any longer just times out in Grafana.
inbound_latency has 14 labels associated with it. Its cardinality is on the order of 10^13, so in the trillions. The way I'm defining cardinality here is going through each label and multiplying the number of unique values all together.
- There are 4 labels with very high numbers of unique values: 740, 442, 140, and 130.
- Calling /prometheus/api/v1/cardinality/active_series on this metric returns 135671, which is the largest value of all of our metrics.
Our current Mimir deployment is as follows (just going to mention the relevant parts):
- ingester has 52 containers, each with 8 CPU and 14 GB memory
- store-gateway has 10 containers, each with 4 CPU and 4 GB memory
  - These allocations are equal or greater than the values recommended in the capacity planning doc.
- For configuration parameters:
  - All forms of caching are enabled.

# this excerpt doesn't include everything, just what I thought is relevant to this discussion
ingester:
  ring:
    replication_factor: 3
    zone_awareness_enabled: false

store_gateway:
  sharding_ring:
    zone_awareness_enabled: false

limits:
  ingestion_rate: 100000
  ingestion_burst_size: 2000000
  max_label_names_per_series: 60
  ingestion_tenant_shard_size: 3
  max_global_series_per_user: 0
  native_histograms_ingestion_enabled: true
  out_of_order_time_window: 30d
  results_cache_ttl: 1w
  results_cache_ttl_for_out_of_order_time_window: 5m
  cardinality_analysis_enabled: true
  store_gateway_tenant_shard_size: 3
  compactor_blocks_retention_period: 3y
  compactor_split_and_merge_shards: 2 # use even number
  compactor_split_groups: 6 # use even number

Problem

The issue I'm running into is that our InfluxDB equivalent for this data (same tags for the labels we have in Mimir) can query 30 days into the past, whereas we can barely hit a single day in Mimir. One thing we tried was to increase the shard size for ingester and store-gateway, but that seemingly had no affect on query performance. This was the change we attempted:

overrides:
  hat:
    ingestion_tenant_shard_size: 9
    store_gateway_tenant_shard_size: 9

The thinking was that, maybe originally, just having 3 instances of ingester and store-gateway each was not sufficient in serving our most expensive tenant, so we should increase the number of instances serving these requests. But because we observed no changes, I'm wondering if we are going about this incorrectly.

How can we tune our Mimir deployment to better serve this particular tenant? Please let me know if I can provide any more details about the scenario.

ericlee123 · 2025-04-12T05:24:39Z

ericlee123
Apr 12, 2025
Author

Also a follow-up question, which is a better signal to look at for how long a query would take?

My definition of cardinality, where I multiplied together all the numbers of unique values for each label
The number of active series returned by this endpoint: /prometheus/api/v1/cardinality/active_series

1 reply

pracucci Apr 15, 2025
Maintainer

Also a follow-up question, which is a better signal to look at for how long a query would take?

My definition of cardinality, where I multiplied together all the numbers of unique values for each label

The number of active series returned by this endpoint: /prometheus/api/v1/cardinality/active_series

The right definition of cardinality is the number of series returned by the /prometheus/api/v1/cardinality/active_series. Your actual cardinality is about 135K series, which is relatively small for Mimir.

ericlee123 · 2025-04-14T19:33:58Z

ericlee123
Apr 14, 2025
Author

@pracucci Tagging since you're so helpful and knowledgeable :)

0 replies

pracucci · 2025-04-15T07:51:48Z

pracucci
Apr 15, 2025
Maintainer

The query histogram_count(sum by (http_status) (delta(inbound_latency{}[2m]))) doesn't look a complex query. It's also a shardable query (assuming you have configured query sharding in Mimir and you have enough capacity on the read path).

I suggest you the following investigation course:

Focus on the query over the "last 12h" as a first step. The last 12h are queried only from ingesters, so that's excludes store-gateway and object storage. It makes it easier to reason about.
If you're tracing Mimir, check where the time is spend
Ensure no Mimir component is overloaded / at capacity (especially in terms of CPU)
Look at CPU utilization across all components when the query runs to find out where the CPU bottleneck is
Ensure all your caches are properly sized and are not evicting entries at a high rate (I assume you configured memcached)

1 reply

ericlee123 Apr 16, 2025
Author

Focus on the query over the "last 12h" as a first step. The last 12h are queried only from ingesters, so that's excludes store-gateway and object storage. It makes it easier to reason about.

This statement lines up with my understanding of our configuration. Specifically that we do not set query_ingesters_within, so it should be at its default value of 13h. However, I am seeing very conclusive evidence that the memory of our store-gateway spikes as a result of my query. I see 3 instances (which matches our shard size) jump to around 40-50% memory utilization, which seems like a fine number to me? But why is there any activity in the store-gateway at all?

As for the ingester, I'm not seeing any significant spikes in resource utilization.

Finally, I am seeing big memory spikes for the querier. Roughly about 8 hosts spike in memory from a very consistent baseline of 5% to anywhere from 25-75%. We have 16 queriers, each with 4 CPU + 12 GB mem.

Ensure no Mimir component is overloaded / at capacity (especially in terms of CPU)

I looked through all of query-frontend, query-scheduler, querier, ingester, and store-gateway and I don't see anything significant with CPU spikes, just the memory spikes I mentioned above.

Given all this, do you see any red flags with our deployment?

(I will be working on getting tracing enabled to have better visibility in our deployment.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to tune Mimir for an expensive tenant #11201

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to tune Mimir for an expensive tenant #11201

Uh oh!

ericlee123 Apr 12, 2025

Problem

Replies: 3 comments · 2 replies

Uh oh!

ericlee123 Apr 12, 2025 Author

Uh oh!

Uh oh!

pracucci Apr 15, 2025 Maintainer

Uh oh!

ericlee123 Apr 14, 2025 Author

Uh oh!

pracucci Apr 15, 2025 Maintainer

Uh oh!

ericlee123 Apr 16, 2025 Author

ericlee123
Apr 12, 2025

Replies: 3 comments 2 replies

ericlee123
Apr 12, 2025
Author

pracucci Apr 15, 2025
Maintainer

ericlee123
Apr 14, 2025
Author

pracucci
Apr 15, 2025
Maintainer

ericlee123 Apr 16, 2025
Author