Mimir Distributors Timing Out with 500 Errors, Despite Scaling #7517
Replies: 1 comment 2 replies
-
Unrelated to the overall issue, Looking at the screenshots you've posted, it seems like there was an ingester continuously restarting (I'm guessing based on the number of in-memory series going from 0 -> 1.5M multiple times) during periods of time where the distributors were seeing timeouts. What does looking at the logs for that pod show? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We are running the mimir-distributed chart. Once the # of ingested samples/sec surpasses a certain level, the number of 500 errors from the distributor components increases, no matter how much we try to scale the number of distributors and ingesters.
Specifically, these 500 errors seem to be coming ContextDeadlineExceeded errors from the distributor component in attempting to connect to the ingesters.
All the 500 errors seem to be the same:
There doesn't seem to be a resource shortage; the ingester/distributors are not hitting their limits

Here are some things we've tried to no effect:
Anyone have any tips on reducing the # of 5xx errors?
Our config we're using in the mimir-distributed helm chart:
Beta Was this translation helpful? Give feedback.
All reactions