-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix: Solve the problem of expired BinlogSync request causing master-slave synchronization to get stuck #3041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
WalkthroughThe pull request modifies the control flow in the Changes
Sequence Diagram(s)sequenceDiagram
participant Worker as PikaReplBgWorker
participant Logger as Logger
Worker->>Worker: HandleBGWorkerWriteBinlog()
alt Session ID mismatch detected
Worker->>Logger: Log session mismatch and warning
note right of Worker: No state change to kTryConnect performed
end
Suggested labels
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/pika_repl_bgworker.cc (1)
129-130
: Disable Reconnection on Expired TaskBy commenting out the
slave_db->SetReplState(ReplState::kTryConnect);
line when a session ID mismatch occurs, this change now properly ignores expired BinlogSync tasks. This aligns with the PR objective to avoid triggering extra reconnection attempts that lead to synchronization deadlocks. Ensure that the ignored task scenario is monitored (if needed) so that persistent issues do not go unnoticed.
…lave synchronization to get stuck
辛苦您看一下下面这个PR中的第二个部分: 请确认一下,在这个PR合入之后,您提到的问题是否还存在,辛苦了! 如果是最新代码出现的这个问题,可以考虑是不是从节点的写阻塞还没解除,处于WaitReply状态(发出了trysync请求,还没对trysync的response进行处理),这种情况等下去应该是能恢复的。 |
Thank you for your hard work, take a look at the second part of the following PR: Please confirm whether the problem you mentioned still exists after this PR is merged. Thank you for your hard work! |
我的是3.3.6版本,可能你修复的版本跟我的版本有点偏差,产生的问题似乎有点不一样,不过触发的原因是类似的,也是连续两次trysync引发的问题,请问一下当时为什么不直接注释而是采用了同步的方式进行处理,直接丢弃不匹配的会有什么问题吗,感觉这样修复也更简单。之前为什么采用的是异步处理,现在改成同步后性能会不会有什么问题 |
Mine is version 3.3.6. Maybe the version you fixed is a bit different from mine. The problems that appear to be a little different. However, the triggering reason is similar, and it is also a problem caused by trysync twice in a row. I would like to ask why I didn’t comment directly but used the synchronization method to process it. Is there any problem that will be discarded directly from the mismatch? I feel that the repair is easier. Why did it use asynchronous processing before? Is there any problem with performance after changing it to synchronization? |
1.不采用注释的方式进行处理主要是无法确定之前这个代码到底处理了什么case,害怕注释掉之后出新的问题,将trysync从异步改成同步的话,至少带来的影响,以及潜在的副作用是能搞清楚的。
(另外在当前这种改异步处理的方案下,还有一种极端情况是,从节点阻塞超过40s(两个超时周期)的情况下,后面主从还能否恢复连接,这点经过验证是可以的。) |
(In addition, under the current asynchronous processing solution, there is another extreme case, if the slave node blocks for more than 40s (two timeout cycles), whether the master and slave can resume the connection after the later are verified.) |
感觉我这种方式风险小一点,毕竟这个问题本身是小概率事件,不过我也不确定会不会引发其他问题,先跑一段时间看看 |
I feel that my approach is less risky. After all, this problem itself is a low probability event, but I am not sure if it will cause other problems. Let's go for a while to see it. |
是的,可以先跑跑看 |
Yes, you can run and watch first |
Solve the problem of expired BinlogSync request causing master-slave synchronization to get stuck
当前有一个问题是,在主从同步的时候,如果从节点处理速度比较慢,很容易触发主从同步的20s超时逻辑,主节点会主动断开与从节点的连接。下一次重连时,从节点可能会存在上一次会话没有处理完的BGWorkerWriteBinlog任务,在处理到这种过期任务时,从节点会重置状态机到trysync状态,触发第三次重连。第三次重连并没有断开第二次重连与master建立的连接,因而主节点收到从节点重新发起的tryscyn后并不会更新这次重连的sessionid,主节点的同步窗口不会重置,还在等待第二次重连发送出去的binlogsync请求的binlongsyncack响应,因而产生了死锁,主从同步卡住不动,必须从节点重启才可以恢复。
本次修复简单忽略了这种过期的请求,从而避免了第三次的重连,确保了主从同步正常。
Summary by CodeRabbit