Possible to separate out ruler read path from everything else #5980

rknightion · 2023-09-10T16:27:59Z

rknightion
Sep 10, 2023
Collaborator

We've recently migrated to Mimir, so far everything has been great with one exception: the Ruler.

When we have the ruler enabled in default configuration, it seems to completely saturate the store-gateways with its requests. Any queries that go beyond 12h and hit the store-gateways rather than the ingesters take forever or time out completely.

It does seem to focus its queries on 3-4 of the gateways based on our p99 latency but its enough to slow down the read-path completely.

We did try enabling Ruler in External Mode (by adding the query_frontend service to the ruler) to take advantage of the read pathway acceleration but this then seemed to completely overwhelm the query path to the point where no read requests were getting through. We scaled up the querier, query-frontend and scheduler but this didn't seem to help. Queries would never complete without an error (no "too many inflight queries" errors or anything like that it was all io/timeout errors or "context canceled" messages reading from memcached).

We're currently running with 6 SGWs (following the large.yaml example) with fast IO disks.
I'm torn between leaving internal mode enabled and trying to scale up the SGWs but when I gave this a quick attempt with 15 SGWs there wasn't really an improvement. We tried enabling streaming from store gateways which did help a bit when doing normal queries but I don't think the ruler in internal mode also streams so the benefit was limited.

So my main questions are:

To run the ruler in external mode using the helm chart is it as simple as setting up the query_frontend service in the structured config?
Are there any queries I can run to try and find the most "expensive" rules in the ruler? - I wouldn't be surprised if there's some ugly unoptimised rules in there (as there are 4500 rules in total from about 20 different tenants/clusters).
Is it possible to set up a completely separate read path for the Ruler? - I'm thinking its own set of queriers, query frontends, store gateways etc so that the read path when users query from grafana is not impacted? This seems like it might be one route for us to persue so we can scale the 2 read paths independently although I'm not sure if the current chart supports this

Our config below from the diff page:

activity_tracker:
    filepath: /active-query-tracker/activity.log
alertmanager:
    data_dir: /data
    external_url: /alertmanager
    fallback_config_file: /configs/alertmanager_fallback_config.yaml
    sharding_ring:
        zone_awareness_enabled: true
alertmanager_storage:
    backend: s3
    s3:
        bucket_name: pl-pop-mimir-alertmanager
        endpoint: s3.us-east-1.amazonaws.com
        region: us-east-1
        sse:
            type: SSE-S3
blocks_storage:
    backend: s3
    bucket_store:
        block_sync_concurrency: 25
        chunks_cache:
            backend: redis
            memcached:
                connect_timeout: 10s
                timeout: 10s
            redis:
                endpoint: mimir-chunks-cache.ng.0001.use1.cache.amazonaws.com:6379   # (this was previously in-cluster memcached but we experienced better performance using redis for some reason)
                max_async_buffer_size: 30000
                max_async_concurrency: 200
                max_get_multi_batch_size: 200
                max_get_multi_concurrency: 200
                max_item_size: 0
                min_idle_connections: 25
                read_timeout: 1m0s
                write_timeout: 30s
        index_cache:
            backend: redis
            memcached:
                connect_timeout: 10s
                timeout: 20s
            redis:
                endpoint: mimir-index-cache.ng.0001.use1.cache.amazonaws.com:6379   # (this was previously in-cluster memcached but we experienced better performance using redis for some reason)
                max_item_size: 0
                read_timeout: 20s
                write_timeout: 10s
        index_header:
            eager_loading_startup_enabled: true
            sparse_persistence_enabled: true
        index_header_lazy_loading_idle_timeout: 6h0m0s
        meta_sync_concurrency: 25
        metadata_cache:
            backend: redis
            memcached:
                connect_timeout: 10s
                timeout: 10s
            redis:
                endpoint: mimir-metadata-cache.ng.0001.use1.cache.amazonaws.com:6379   # (this was previously in-cluster memcached but we experienced better performance using redis for some reason)
                max_item_size: 0
                read_timeout: 20s
                write_timeout: 10s
        sync_dir: /data/tsdb-sync
    s3:
        bucket_name: pl-pop-mimir-blocks
        endpoint: s3.us-east-1.amazonaws.com
        region: us-east-1
        sse:
            type: SSE-S3
    tsdb:
        dir: /data/tsdb
        head_compaction_interval: 15m0s
        wal_replay_concurrency: 3
compactor:
    compaction_concurrency: 2
    compaction_interval: 15m0s
    data_dir: /data
    deletion_delay: 2h0m0s
    max_closing_blocks_concurrency: 2
    max_opening_blocks_concurrency: 4
    no_blocks_file_cleanup_enabled: true
    sharding_ring:
        wait_stability_min_duration: 1m0s
    symbols_flushers_concurrency: 4
frontend:
    cache_results: true
    cache_unaligned_requests: true
    max_outstanding_per_tenant: 1000
    parallelize_shardable_queries: true
    results_cache:
        backend: redis
        redis:
            endpoint: mimir-results-cache.ng.0001.use1.cache.amazonaws.com:6379  # (this was previously in-cluster memcached but we experienced better performance using redis for some reason)
            max_item_size: 0
            read_timeout: 20s
            write_timeout: 10s
    scheduler_address: mimir-query-scheduler-headless.mimir.svc:9095
frontend_worker:
    grpc_client_config:
        max_send_msg_size: 419430400
    scheduler_address: mimir-query-scheduler-headless.mimir.svc:9095
ingester:
    ring:
        instance_availability_zone: 1b
        num_tokens: 512
        tokens_file_path: /data/tokens
        unregister_on_shutdown: false
        zone_awareness_enabled: true
ingester_client:
    grpc_client_config:
        grpc_compression: gzip
limits:
    cardinality_analysis_enabled: true
    compactor_blocks_retention_period: 1y
    compactor_split_and_merge_shards: 2
    ingestion_burst_size: 50000000
    ingestion_rate: 5e+07
    max_fetched_chunks_per_query: 0
    max_global_exemplars_per_user: 100000
    max_global_series_per_user: 0
    max_label_names_per_series: 100
    max_query_parallelism: 30
    max_total_query_length: 500d
    native_histograms_ingestion_enabled: true
    out_of_order_time_window: 10m
    results_cache_ttl: 6w
    results_cache_ttl_for_cardinality_query: 6w
    results_cache_ttl_for_labels_query: 6w
    ruler_max_rule_groups_per_tenant: 0
    ruler_max_rules_per_rule_group: 0
memberlist:
    compression_enabled: false
    join_members:
        - dns+mimir-gossip-ring.mimir.svc.cluster.local:7946
    left_ingesters_timeout: 1m0s
querier:
    max_concurrent: 16
    streaming_chunks_per_ingester_series_buffer_size: 512
    streaming_chunks_per_store_gateway_series_buffer_size: 512
query_scheduler:
    max_outstanding_requests_per_tenant: 1000
ruler:
    alertmanager_url: dnssrvnoa+http://_http-metrics._tcp.mimir-alertmanager-headless.mimir.svc.cluster.local/alertmanager
    query_frontend:
        grpc_client_config:
            grpc_compression: gzip
    rule_path: /data
ruler_storage:
    backend: s3
    s3:
        bucket_name: pl-pop-mimir-ruler
        endpoint: s3.us-east-1.amazonaws.com
        region: us-east-1
        sse:
            type: SSE-S3
runtime_config:
    file: /var/mimir/runtime.yaml
server:
    grpc_server_max_concurrent_streams: 1000
    grpc_server_max_connection_age: 2m0s
    grpc_server_max_connection_age_grace: 5m0s
    grpc_server_max_connection_idle: 1m0s
store_gateway:
    sharding_ring:
        kvstore:
            prefix: multi-zone/
        tokens_file_path: /data/tokens
        unregister_on_shutdown: false
        wait_stability_min_duration: 1m0s
        zone_awareness_enabled: true
target: ingester
usage_stats:
    installation_mode: helm

Answered by dimitarvdimitrov

Sep 13, 2023

I'm torn between leaving internal mode enabled and trying to scale up the SGWs but when I gave this a quick attempt with 15 SGWs there wasn't really an improvement. We tried enabling streaming from store gateways which did help a bit when doing normal queries but I don't think the ruler in internal mode also streams so the benefit was limited.

Have you tried looking into the "Mimir / Queries" and "Mimir / Reads resources" dashboards? That might give some clues about why queries might be slow and whether it has anything to do with compute resources in the store-gateways

To run the ruler in external mode using the helm chart is it as simple as setting up the query_frontend service in th…

View full answer

dimitarvdimitrov · 2023-09-13T11:42:27Z

dimitarvdimitrov
Sep 13, 2023
Maintainer

I'm torn between leaving internal mode enabled and trying to scale up the SGWs but when I gave this a quick attempt with 15 SGWs there wasn't really an improvement. We tried enabling streaming from store gateways which did help a bit when doing normal queries but I don't think the ruler in internal mode also streams so the benefit was limited.

Have you tried looking into the "Mimir / Queries" and "Mimir / Reads resources" dashboards? That might give some clues about why queries might be slow and whether it has anything to do with compute resources in the store-gateways

To run the ruler in external mode using the helm chart is it as simple as setting up the query_frontend service in the structured config?

This article has more details, but it should be as easy as pointing the ruler to the address of the query-frontend via the -ruler.query-frontend.address CLI flag (or YAML).

Are there any queries I can run to try and find the most "expensive" rules in the ruler? - I wouldn't be surprised if there's some ugly unoptimised rules in there (as there are 4500 rules in total from about 20 different tenants/clusters).

when using the external ruler mode, the query-frontend should be logging all the queries that you run. You can search for query stats in the query-frontend logs. When the ruler is running in its default internal mode you can enable them by setting -ruler.query-stats-enabled CLI arg or mimir.structuredConfig.ruler.query_stats_enabled: true in the helm chart.

Is it possible to set up a completely separate read path for the Ruler? - I'm thinking its own set of queriers, query frontends, store gateways etc so that the read path when users query from grafana is not impacted? This seems like it might be one route for us to persue so we can scale the 2 read paths independently although I'm not sure if the current chart supports this

That's a good option. If the latency is coming from the store-gateways, then even reusing the store-gateways from the regular query path may not make much of a difference. You can do this, however the helm chart does not support it out of the box. You'd have to create a separate store-gateway, query-frontend, query-scheduler, and querier Deployments/StatefulSets. For the most part they can be identical to the existing ones in the helm chart. The only differences will have to be

the URL/DNS of corresponding components (i.e. configure ruler-query-frontend to talk to ruler-query-scheduler, not just query-scheduler).
configure a different ring prefix for the store-gateways (-store-gateway.sharding-ring.prefix) so that they shard blocks between themselves and don't share the load with the regular read path store-gateways; you'd have to configure this flag on the ruler-queriers as well so they know where to find the store-gateway ring

1 reply

rknightion Sep 16, 2023
Collaborator Author

Thanks!
Before going down the route of setting up a dedicated ruler, I thought it worthwhile exploring the metrics a bit more (and it might be useful for others).

One thing I do notice a lot of in our ruler logs is
ts=2023-09-16T17:15:02.015137356Z caller=spanlogger.go:109 user=plprod method=blocksStoreQuerier.fetchSeriesFromStores user=plprod level=warn msg="failed to receive series" remote=10.30.19.66:9095 err="rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: read tcp 10.30.52.192:45348->10.30.19.66:9095: read: connection reset by peer\", received prior goaway: code: NO_ERROR, debug data: \"max_age\""

The destination IP is one of the store gateways and it always seems to be the same 2 or 3 store gateway IPs in the list (out of 6 store gateways total)

I also notice the requests to S3 shoot up to ~1.8k RPS with ruler enabled on internal mode. All of this is from the SGW (normally we're < 10-15 RPS). There's also a ~20% error rate:

We've begun the process of pruning our rules as it turns out from the slow query log we have a lot of unnecessarily expensive rules!

I did also stumble across #5081 - looks like someone else wanted to scale the 2 read paths independently as well!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible to separate out ruler read path from everything else #5980

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Possible to separate out ruler read path from everything else #5980

Uh oh!

Uh oh!

rknightion Sep 10, 2023 Collaborator

Replies: 1 comment · 1 reply

Uh oh!

dimitarvdimitrov Sep 13, 2023 Maintainer

Uh oh!

rknightion Sep 16, 2023 Collaborator Author

rknightion
Sep 10, 2023
Collaborator

Replies: 1 comment 1 reply

dimitarvdimitrov
Sep 13, 2023
Maintainer

rknightion Sep 16, 2023
Collaborator Author