Metadata requests for a topic with many partitions #3916
Unanswered
travisdowns
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Consider a scenario where 50k distinct consumer processes using librdkafka consume from a single topic with 50k partitions on it using a consumer group. Since there is a 1:1 ratio of consumers to partitions, each consumer is ultimately only consuming from ~1 topic (with some small variation as consumers die or reenter the group).
However, empirically, the metadata requests ask about all partitions in the topic, so that's 50k partitions * 50k clients = 2.5 billion partitions worth of metadata sent every refresh interval (ignoring entirely additional metadata refreshes when certain conditions occur that might trigger them, like a topic leader change). If each partition takes ~100 bytes (a reasonable value, empirically, with 3 replicas per partition) that's 250 GB of traffic every 300s (by default) or ~6.7 Gbps constant load just from the periodic metadata refreshes.
Is there any way around this? The metadata API does not seem to admit any feature to ask only about a subset of the partitions in a topic: you may provide a list of topics, but not a list of partitions within those topics. So clients will retrieve all 50k partitions even though they just care about one.
The easy answer is "don't do that [50k partitions in one topic]" but those of us building infrastructure aren't always in the position to choose what users do.
Beta Was this translation helpful? Give feedback.
All reactions