Possible to separate out ruler read path from everything else #5980
-
We've recently migrated to Mimir, so far everything has been great with one exception: the Ruler. When we have the ruler enabled in default configuration, it seems to completely saturate the store-gateways with its requests. Any queries that go beyond 12h and hit the store-gateways rather than the ingesters take forever or time out completely. It does seem to focus its queries on 3-4 of the gateways based on our p99 latency but its enough to slow down the read-path completely. We did try enabling Ruler in External Mode (by adding the query_frontend service to the ruler) to take advantage of the read pathway acceleration but this then seemed to completely overwhelm the query path to the point where no read requests were getting through. We scaled up the querier, query-frontend and scheduler but this didn't seem to help. Queries would never complete without an error (no "too many inflight queries" errors or anything like that it was all io/timeout errors or "context canceled" messages reading from memcached). We're currently running with 6 SGWs (following the large.yaml example) with fast IO disks. So my main questions are:
Our config below from the diff page: activity_tracker:
filepath: /active-query-tracker/activity.log
alertmanager:
data_dir: /data
external_url: /alertmanager
fallback_config_file: /configs/alertmanager_fallback_config.yaml
sharding_ring:
zone_awareness_enabled: true
alertmanager_storage:
backend: s3
s3:
bucket_name: pl-pop-mimir-alertmanager
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
sse:
type: SSE-S3
blocks_storage:
backend: s3
bucket_store:
block_sync_concurrency: 25
chunks_cache:
backend: redis
memcached:
connect_timeout: 10s
timeout: 10s
redis:
endpoint: mimir-chunks-cache.ng.0001.use1.cache.amazonaws.com:6379 # (this was previously in-cluster memcached but we experienced better performance using redis for some reason)
max_async_buffer_size: 30000
max_async_concurrency: 200
max_get_multi_batch_size: 200
max_get_multi_concurrency: 200
max_item_size: 0
min_idle_connections: 25
read_timeout: 1m0s
write_timeout: 30s
index_cache:
backend: redis
memcached:
connect_timeout: 10s
timeout: 20s
redis:
endpoint: mimir-index-cache.ng.0001.use1.cache.amazonaws.com:6379 # (this was previously in-cluster memcached but we experienced better performance using redis for some reason)
max_item_size: 0
read_timeout: 20s
write_timeout: 10s
index_header:
eager_loading_startup_enabled: true
sparse_persistence_enabled: true
index_header_lazy_loading_idle_timeout: 6h0m0s
meta_sync_concurrency: 25
metadata_cache:
backend: redis
memcached:
connect_timeout: 10s
timeout: 10s
redis:
endpoint: mimir-metadata-cache.ng.0001.use1.cache.amazonaws.com:6379 # (this was previously in-cluster memcached but we experienced better performance using redis for some reason)
max_item_size: 0
read_timeout: 20s
write_timeout: 10s
sync_dir: /data/tsdb-sync
s3:
bucket_name: pl-pop-mimir-blocks
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
sse:
type: SSE-S3
tsdb:
dir: /data/tsdb
head_compaction_interval: 15m0s
wal_replay_concurrency: 3
compactor:
compaction_concurrency: 2
compaction_interval: 15m0s
data_dir: /data
deletion_delay: 2h0m0s
max_closing_blocks_concurrency: 2
max_opening_blocks_concurrency: 4
no_blocks_file_cleanup_enabled: true
sharding_ring:
wait_stability_min_duration: 1m0s
symbols_flushers_concurrency: 4
frontend:
cache_results: true
cache_unaligned_requests: true
max_outstanding_per_tenant: 1000
parallelize_shardable_queries: true
results_cache:
backend: redis
redis:
endpoint: mimir-results-cache.ng.0001.use1.cache.amazonaws.com:6379 # (this was previously in-cluster memcached but we experienced better performance using redis for some reason)
max_item_size: 0
read_timeout: 20s
write_timeout: 10s
scheduler_address: mimir-query-scheduler-headless.mimir.svc:9095
frontend_worker:
grpc_client_config:
max_send_msg_size: 419430400
scheduler_address: mimir-query-scheduler-headless.mimir.svc:9095
ingester:
ring:
instance_availability_zone: 1b
num_tokens: 512
tokens_file_path: /data/tokens
unregister_on_shutdown: false
zone_awareness_enabled: true
ingester_client:
grpc_client_config:
grpc_compression: gzip
limits:
cardinality_analysis_enabled: true
compactor_blocks_retention_period: 1y
compactor_split_and_merge_shards: 2
ingestion_burst_size: 50000000
ingestion_rate: 5e+07
max_fetched_chunks_per_query: 0
max_global_exemplars_per_user: 100000
max_global_series_per_user: 0
max_label_names_per_series: 100
max_query_parallelism: 30
max_total_query_length: 500d
native_histograms_ingestion_enabled: true
out_of_order_time_window: 10m
results_cache_ttl: 6w
results_cache_ttl_for_cardinality_query: 6w
results_cache_ttl_for_labels_query: 6w
ruler_max_rule_groups_per_tenant: 0
ruler_max_rules_per_rule_group: 0
memberlist:
compression_enabled: false
join_members:
- dns+mimir-gossip-ring.mimir.svc.cluster.local:7946
left_ingesters_timeout: 1m0s
querier:
max_concurrent: 16
streaming_chunks_per_ingester_series_buffer_size: 512
streaming_chunks_per_store_gateway_series_buffer_size: 512
query_scheduler:
max_outstanding_requests_per_tenant: 1000
ruler:
alertmanager_url: dnssrvnoa+http://_http-metrics._tcp.mimir-alertmanager-headless.mimir.svc.cluster.local/alertmanager
query_frontend:
grpc_client_config:
grpc_compression: gzip
rule_path: /data
ruler_storage:
backend: s3
s3:
bucket_name: pl-pop-mimir-ruler
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
sse:
type: SSE-S3
runtime_config:
file: /var/mimir/runtime.yaml
server:
grpc_server_max_concurrent_streams: 1000
grpc_server_max_connection_age: 2m0s
grpc_server_max_connection_age_grace: 5m0s
grpc_server_max_connection_idle: 1m0s
store_gateway:
sharding_ring:
kvstore:
prefix: multi-zone/
tokens_file_path: /data/tokens
unregister_on_shutdown: false
wait_stability_min_duration: 1m0s
zone_awareness_enabled: true
target: ingester
usage_stats:
installation_mode: helm |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Have you tried looking into the "Mimir / Queries" and "Mimir / Reads resources" dashboards? That might give some clues about why queries might be slow and whether it has anything to do with compute resources in the store-gateways
This article has more details, but it should be as easy as pointing the ruler to the address of the query-frontend via the
when using the external ruler mode, the query-frontend should be logging all the queries that you run. You can search for
That's a good option. If the latency is coming from the store-gateways, then even reusing the store-gateways from the regular query path may not make much of a difference. You can do this, however the helm chart does not support it out of the box. You'd have to create a separate store-gateway, query-frontend, query-scheduler, and querier Deployments/StatefulSets. For the most part they can be identical to the existing ones in the helm chart. The only differences will have to be
|
Beta Was this translation helpful? Give feedback.
Have you tried looking into the "Mimir / Queries" and "Mimir / Reads resources" dashboards? That might give some clues about why queries might be slow and whether it has anything to do with compute resources in the store-gateways