Replies: 1 comment 2 replies
-
The different code paths is because partition 83 is fetching from follower - and in this case OFFSET_OUT_RANGE is not treated as an auto.offset.reset error but just migration back to the leader. Now; given that the broker is healthy, auto.offset.reset to BEGINNING should start fetching from the beginning, i.e., there should be no hang. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
👋🏻 Hello! I am investigating a weird behavior con librdkafka 1.8.2 where some partitions will get stuck forever (acumulating lag and not processing anything) after broker failures.
My consumer are using fetch-from-replica, and after a broker failure, most partitions reverted back to reading from the leader with the following log:
However, at least one of them went through a different codepath:
These code paths are defined in https://github.com/edenhill/librdkafka/blob/2d78e928d8c0d798f341b1843c97eb6dcdecefc3/src/rdkafka_broker.c#L4279-L4308
One thing that strikes me as odd is that the partitions where it all went fine the logs are emitted from a thread with the broker's ip address, whereas the partition that got stuck logged from a thread named "main".
No errors bubbled up to the consumer, but this partition got stuck and stopped doing anything for 10+ minutes, until the service was restarted.
I looks to me like this is an issue in librdkafka (because the consumer wasn't notified of anything and the partition just stopped consuming), but I don't have enough information to open an issue or propose a fix yet. Does anyone have any tips on what to look into next?
Beta Was this translation helpful? Give feedback.
All reactions