Deadlocked partitions on broker instability #3785

edrevo · 2022-03-28T17:43:01Z

edrevo
Mar 28, 2022

👋🏻 Hello! I am investigating a weird behavior con librdkafka 1.8.2 where some partitions will get stuck forever (acumulating lag and not processing anything) after broker failures.

My consumer are using fetch-from-replica, and after a broker failure, most partitions reverted back to reading from the leader with the following log:

%4|1648436449.696|FETCH|rdkafka#consumer-2| [thrd:<IP-address>:9092/10079]: Topic <topic_name> [83]: Offset 2418601 out of range (HighwaterMark -1 fetching from broker 10079 (leader 10015): reverting to leader

However, at least one of them went through a different codepath:

%4|1648436445.529|OFFSET|rdkafka#consumer-2| [thrd:main]: <topic_name> [19]: offset reset (at offset 2440595) to BEGINNING: fetch failed due to requested offset not available on the broker: Broker: Offset out of range

These code paths are defined in https://github.com/edenhill/librdkafka/blob/2d78e928d8c0d798f341b1843c97eb6dcdecefc3/src/rdkafka_broker.c#L4279-L4308

One thing that strikes me as odd is that the partitions where it all went fine the logs are emitted from a thread with the broker's ip address, whereas the partition that got stuck logged from a thread named "main".

No errors bubbled up to the consumer, but this partition got stuck and stopped doing anything for 10+ minutes, until the service was restarted.

I looks to me like this is an issue in librdkafka (because the consumer wasn't notified of anything and the partition just stopped consuming), but I don't have enough information to open an issue or propose a fix yet. Does anyone have any tips on what to look into next?

edenhill · 2022-04-05T08:09:26Z

edenhill
Apr 5, 2022

The different code paths is because partition 83 is fetching from follower - and in this case OFFSET_OUT_RANGE is not treated as an auto.offset.reset error but just migration back to the leader.
For the second case I believe partition 19 is fetched from the leader, and since the leader is authoritative the OFFSET_OUT_OF_RANGE error triggers auto.offset.reset.

Now; given that the broker is healthy, auto.offset.reset to BEGINNING should start fetching from the beginning, i.e., there should be no hang.
But, since the leader is actually reporting OFFSET_OUT_OF_RANGE there might be something fishy happening.
Have you checked broker side logs for any hints?

2 replies

edenhill Apr 5, 2022

Including the broker_id in the reset logs would've helped here, so added this PR: #3797

edrevo Apr 11, 2022
Author

Many thanks for adding the broker_id to the logs! I did check the broker and it was clearly unhealthy for a period of time. However, after recovering librdkafka would still not process anything. What happens if the broker reports OFFSET_OUT_OF_RANGE? Will librdkafka never againt attempt to reset the offset to BEGINNING?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deadlocked partitions on broker instability #3785

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Deadlocked partitions on broker instability #3785

Uh oh!

edrevo Mar 28, 2022

Replies: 1 comment · 2 replies

Uh oh!

edenhill Apr 5, 2022

Uh oh!

edenhill Apr 5, 2022

Uh oh!

edrevo Apr 11, 2022 Author

edrevo
Mar 28, 2022

Replies: 1 comment 2 replies

edenhill
Apr 5, 2022

edrevo Apr 11, 2022
Author