In UCX, even with flow control (FC) enabled, why does the client still continuously experience RNR errors? #10580
Unanswered
super-train
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
I am using UCX version 1.12, and I have a question regarding flow control (FC) that has been bothering me for a long time. I would greatly appreciate any help you can provide; thank you very much!
In the code for UCX version 1.12, the sender has an initial value for fc_wnd, which has two thresholds: a soft threshold (calculated as 0.5 * fc_wnd) and a hard threshold (calculated as 0.2 * fc_wnd). Each time the sender sends I/O, it checks the current value of fc_wnd. If fc_wnd decreases to the soft or hard threshold, it marks the sent message's am_id with the corresponding FC soft request tag or hard request tag. When the receiver receives the AM message, it checks whether the am_id contains an FC tag. If it's a soft request tag, it merely sets a flag on the endpoint indicating that it needs to reply to the sender with permission to enlarge the fc_wnd. When the receiver has a message to send back to the sender next time, the am_id will carry the tag for granting the enlarged fc_wnd. If it’s a hard request tag, the receiver immediately prepares a message to grant an increase in fc_wnd and places it at the front of the pending queue, waiting to be sent to the sender.
My question is: why does the receiver directly send a response to the sender with permission to expand fc_wnd when it receives a request to enlarge the fc_wnd, without checking if its own receive queue resources are sufficient? This is something I find very difficult to understand.
For example, if the receiver still has 10 resources in its receive queue but immediately notifies the sender to enlarge fc_wnd, the sender will start sending a large amount of messages right away, eventually leading to the receiver exhausting its receive queue resources. This results in the sender encountering RNR errors. This is not just a hypothetical situation; it reflects the issues I am currently facing. Multiple senders are sending messages to a single receiver (essentially multiple clients sending messages to the same server), causing the clients to continuously generate RNR errors.
What is the significance and purpose of fc_wnd in this context, and how can I resolve the issue I've described above?
I eagerly await any answers from everyone. Thank you once again!
Beta Was this translation helpful? Give feedback.
All reactions