Mimir Distributors Timing Out with 500 Errors, Despite Scaling #7517

GreyLilac09 · 2024-03-01T17:11:20Z

GreyLilac09
Mar 1, 2024

We are running the mimir-distributed chart. Once the # of ingested samples/sec surpasses a certain level, the number of 500 errors from the distributor components increases, no matter how much we try to scale the number of distributors and ingesters.

Specifically, these 500 errors seem to be coming ContextDeadlineExceeded errors from the distributor component in attempting to connect to the ingesters.

All the 500 errors seem to be the same:

ts=2024-03-01T17:02:06.833122331Z caller=logging.go:86 level=warn traceID=... msg="POST /api/v1/push (500) 2.013987101s Response: \"failed pushing to ingester: rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Accept-Encoding: identity; Content-Encoding: snappy; Content-Length: 558457; ... X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Forwarded-Scheme: https; X-Prometheus-Remote-Write-Version: 0.1.0; ... X-Request-Id: 68522a3a2faa964e4130ae779d0fc049; X-Scheme: https; "

There doesn't seem to be a resource shortage; the ingester/distributors are not hitting their limits

Here are some things we've tried to no effect:

doubling the # of distributors/increasing the # of ingesters
increasing the # of IOPS on our S3 storage
increasing the mimir limit overrides (max_global_series_per_user/ max_global_series_per_metric/ingestion_burst_size/ingestion_rate)

Anyone have any tips on reducing the # of 5xx errors?

Our config we're using in the mimir-distributed helm chart:

runtimeConfig:
  overrides:
    anonymous:
      max_global_series_per_user: 25000000
      max_global_series_per_metric: 1500000
      ingestion_rate: 1000000
      ingestion_burst_size: 12000000
      max_label_names_per_series: 50
....
mimir:
  structuredConfig:
    multitenancy_enabled: false
    server:
      grpc_server_max_recv_msg_size: 104857600
      grpc_server_max_send_msg_size: 104857600
      grpc_server_max_concurrent_streams: 1000
    ingester:
      ring:
        replication_factor: 2

56quarters · 2024-03-01T19:18:27Z

56quarters
Mar 1, 2024
Maintainer

ingester:
ring:
replication_factor: 2

Unrelated to the overall issue, replication_factor: 2 doesn't provide any benefit. Writes need a majority of ingesters available to succeed. A replication factor of 2 means that losing a single ingester (or zone if you have configured zone-awareness) will cause writes to fail. Only odd numbers make sense for replication factor, this is why the default is 3.

Looking at the screenshots you've posted, it seems like there was an ingester continuously restarting (I'm guessing based on the number of in-memory series going from 0 -> 1.5M multiple times) during periods of time where the distributors were seeing timeouts. What does looking at the logs for that pod show?

2 replies

GreyLilac09 Mar 1, 2024
Author

I'm not seeing any pod restarts.

I'm seeing that that ingester-2 in particular seems to be getting a bunch of WriteTo failed errors, see this graph in Loki (I am filtering to only showing messages with the WriteTo error)

ts=2024-03-01T20:11:23.523816471Z caller=tcp_transport.go:428 level=warn component="memberlist TCPTransport" msg="WriteTo failed" addr=<addr> err="dial tcp <addr>: i/o timeout"

I'm also seeing some errors in ingester-2 logs where it suspects failure/is suspected to fail. The other ingesters also have logs where it suspects ingester-2 to be failing

ts=2024-03-01T20:14:17.312195961Z caller=log.go:194 level=info msg="Suspect grafana-mimir-distributor-5dd7b6455-8x7sq-441b12f1 has failed, no acks received"
ts=2024-03-01T20:14:47.312107898Z caller=log.go:194 level=info msg="Suspect grafana-mimir-querier-78b7f96476-76k5c-c8b26a87 has failed, no acks received"
ts=2024-03-01T20:14:48.948173879Z caller=log.go:194 level=warn msg="Refuting a suspect message (from: grafana-mimir-ingester-2-b7ace908)"
ts=2024-03-01T20:15:25.221955171Z caller=log.go:194 level=warn msg="Refuting a suspect message (from: grafana-mimir-distributor-5dd7b6455-rlbvw-af8de956)"
ts=2024-03-01T20:15:27.312580103Z caller=log.go:194 level=info msg="Suspect grafana-mimir-distributor-5dd7b6455-86wnb-1fa53469 has failed, no acks received"
ts=2024-03-01T20:16:07.316091413Z caller=log.go:194 level=info msg="Suspect grafana-mimir-querier-78b7f96476-d7sxh-9c00f448 has failed, no acks received"
ts=2024-03-01T20:16:08.983586479Z caller=log.go:194 level=warn msg="Refuting a suspect message (from: grafana-mimir-ingester-14-f2c5c7b0)"
ts=2024-03-01T20:16:42.31278337Z caller=log.go:194 level=info msg="Suspect grafana-mimir-ruler-77cd7b7ddd-6ggmq-35603722 has failed, no acks received"
ts=2024-03-01T20:16:48.957292686Z caller=log.go:194 level=warn msg="Refuting a suspect message (from: grafana-mimir-ingester-2-b7ace908)"
ts=2024-03-01T20:17:07.315849695Z caller=log.go:194 level=info msg="Suspect grafana-mimir-ingester-10-d4094b90 has failed, no acks received"
ts=2024-03-01T20:17:32.31602857Z caller=log.go:194 level=info msg="Suspect grafana-mimir-distributor-5dd7b6455-l7qkk-0a60dfc2 has failed, no acks received"
ts=2024-03-01T20:17:35.539111159Z caller=log.go:194 level=warn msg="Refuting a dead message (from: grafana-mimir-distributor-5dd7b6455-24ln9-e8087f89)"
ts=2024-03-01T20:18:08.699939808Z caller=log.go:194 level=warn msg="Refuting a suspect message (from: grafana-mimir-ingester-12-8a41b4fc)"
ts=2024-03-01T20:18:37.314689832Z caller=log.go:194 level=info msg="Suspect grafana-mimir-ingester-14-f2c5c7b0 has failed, no acks received"
ts=2024-03-01T20:18:48.962952367Z caller=log.go:194 level=warn msg="Refuting a suspect message (from: grafana-mimir-ingester-2-b7ace908)"
ts=2024-03-01T20:18:57.315294544Z caller=log.go:194 level=info msg="Suspect grafana-mimir-distributor-5dd7b6455-tc27c-e1b947c5 has failed, no acks received"
ts=2024-03-01T20:19:30.517333586Z caller=log.go:194 level=warn msg="Refuting a suspect message (from: grafana-mimir-ingester-1-c2a71940)"
ts=2024-03-01T20:19:37.315496638Z caller=log.go:194 level=info msg="Suspect grafana-mimir-ingester-10-d4094b90 has failed, no acks received"
ts=2024-03-01T20:20:10.203092445Z caller=log.go:194 level=warn msg="Refuting a suspect message (from: grafana-mimir-querier-78b7f96476-g4nnt-3b9250e6)"
ts=2024-03-01T20:20:17.312122372Z caller=log.go:194 level=info msg="Suspect grafana-mimir-ingester-1-c2a71940 has failed, no acks received"
ts=2024-03-01T20:20:49.996598865Z caller=log.go:194 level=warn msg="Refuting a suspect message (from: grafana-mimir-ingester-2-b7ace908)"
ts=2024-03-01T20:20:57.312586262Z caller=log.go:194 level=info msg="Suspect grafana-mimir-ingester-4-a520939c has failed, no acks received"
ts=2024-03-01T20:21:25.018424297Z caller=log.go:194 level=warn msg="Refuting a suspect message (from: grafana-mimir-distributor-5dd7b6455-l7qkk-0a60dfc2)"
ts=2024-03-01T20:21:37.313120469Z caller=log.go:194 level=info msg="Suspect grafana-mimir-distributor-5dd7b6455-86wnb-1fa53469 has failed, no acks received"
ts=2024-03-01T20:22:10.172277983Z caller=log.go:194 level=warn msg="Refuting a suspect message (from: grafana-mimir-querier-78b7f96476-g4nnt-3b9250e6)"
...

We've configured pod affinity such that all ingesters run on different kube nodes

GreyLilac09 Mar 1, 2024
Author

FYI, the 0-1.5 M in ingester-2 appears to be caused by the pod not being responsive to prometheus during those time periods, not because the actual in-memory series is going from 0 back down to 1.5M and back up again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mimir Distributors Timing Out with 500 Errors, Despite Scaling #7517

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Mimir Distributors Timing Out with 500 Errors, Despite Scaling #7517

Uh oh!

Uh oh!

GreyLilac09 Mar 1, 2024

Replies: 1 comment · 2 replies

Uh oh!

56quarters Mar 1, 2024 Maintainer

Uh oh!

Uh oh!

GreyLilac09 Mar 1, 2024 Author

Uh oh!

GreyLilac09 Mar 1, 2024 Author

GreyLilac09
Mar 1, 2024

Replies: 1 comment 2 replies

56quarters
Mar 1, 2024
Maintainer

GreyLilac09 Mar 1, 2024
Author

GreyLilac09 Mar 1, 2024
Author