"target not available" error when losing one of three thanos receive replicas #5108

dwilliams782 · 2022-01-28T17:24:54Z

dwilliams782
Jan 28, 2022

Hi all,

I am attempting to run Thanos Receive in k8s as a statefulset. Our use case is very small as Prometheus does almost all of our heavy lifting, however we evaluate some rules using Loki Ruler and use remote write to send these to Prometheus. Due to our Prom being HA, there is a potential of data loss if we lose an instance, so we want to use Thanos Receive to provide a HA solution for remote write metrics.

I have the following config:

args:
  - receive
  - --tsdb.path=/var/thanos/receive
  - --tsdb.retention=1d
  - --grpc-address=0.0.0.0:10901
  - --http-address=0.0.0.0:10902
  - --receive.replication-factor=2
  - --label=replica="$(NAME)"
  - --receive.local-endpoint=$(NAME).thanos-receive.$(NAMESPACE).svc.cluster.local:10901
  - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
  - --remote-write.address=0.0.0.0:10903
  - --objstore.config-file=/config/thanos.yaml

And the hashring config:

hashrings.json: '[{"endpoints": ["thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901", "thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901", "thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901"], "hashring": "default", "tenants": [ ]}]'

We don't need three replicas, but based on this discussion: #3194, three replicas with a replication factor of 2 seems to be the minimum required to have metrics replicated on two instances of Receive whilst providing resiliency against instance failure?

I can see data being received from two of the three instances (thanos-receive-1 and thanos-receive-2) as expected:

To test we have resiliency against a pod failing, I deleted thanos-receive-1, expecting that requests would start getting routed to thanos-receive-0. Instead, the logs in thanos-receive-2 started spamming multiple times per second:

2022-01-28 16:11:05 | level=error ts=2022-01-28T16:11:05.21198752Z caller=handler.go:366 component=receive component=receive-handler err="replicate write request for endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: quorum not reached: backing off forward request for endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: target not available" msg="internal server error"

This continued until I restored thanos-receive-1, so the -0 index instance did not get replicated to.

Have I misunderstood a concept here, or got some configuration wrong?

GiedriusS · 2022-02-03T08:28:44Z

GiedriusS
Feb 3, 2022
Maintainer

Perhaps the problem is that you are pointing each node to a service which is a load-balancer in front of pods in Kubernetes, right? Perhaps you could try creating a headless service and then to use dns+ to get all of the IPs of pods?

1 reply

dwilliams782 Feb 3, 2022
Author

Hi, thanks for taking the time to respond. Could you confirm if you are referring to the hashring config? The thanos-receive service is indeed headless (else thanos-receive-0.thanos-receive.x.y wouldn't work), but I haven't seen it mentioned that the hashring supports service discovery.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"target not available" error when losing one of three thanos receive replicas #5108

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

"target not available" error when losing one of three thanos receive replicas #5108

Uh oh!

Uh oh!

dwilliams782 Jan 28, 2022

Replies: 1 comment · 1 reply

Uh oh!

GiedriusS Feb 3, 2022 Maintainer

Uh oh!

dwilliams782 Feb 3, 2022 Author

dwilliams782
Jan 28, 2022

Replies: 1 comment 1 reply

GiedriusS
Feb 3, 2022
Maintainer

dwilliams782 Feb 3, 2022
Author