Skip to content

Conversation

morningli
Copy link

@morningli morningli commented Mar 10, 2025

Solve the problem of expired BinlogSync request causing master-slave synchronization to get stuck

当前有一个问题是,在主从同步的时候,如果从节点处理速度比较慢,很容易触发主从同步的20s超时逻辑,主节点会主动断开与从节点的连接。下一次重连时,从节点可能会存在上一次会话没有处理完的BGWorkerWriteBinlog任务,在处理到这种过期任务时,从节点会重置状态机到trysync状态,触发第三次重连。第三次重连并没有断开第二次重连与master建立的连接,因而主节点收到从节点重新发起的tryscyn后并不会更新这次重连的sessionid,主节点的同步窗口不会重置,还在等待第二次重连发送出去的binlogsync请求的binlongsyncack响应,因而产生了死锁,主从同步卡住不动,必须从节点重启才可以恢复。

本次修复简单忽略了这种过期的请求,从而避免了第三次的重连,确保了主从同步正常。

Summary by CodeRabbit

  • Bug Fixes
    • Improved replication error handling by ignoring expired tasks when a session ID mismatch occurs, preventing unnecessary state transitions while retaining clear logging for consistent monitoring.

Copy link

coderabbitai bot commented Mar 10, 2025

Walkthrough

The pull request modifies the control flow in the PikaReplBgWorker class. In the HandleBGWorkerWriteBinlog method, the line that changes the replication state of the slave database to ReplState::kTryConnect has been commented out. This means that on a session ID mismatch, the system logs the issue but no longer attempts the state change.

Changes

File Summary
src/pika_repl_bgworker.cc Commented out the state change to ReplState::kTryConnect on session ID mismatch; logging remains unchanged.

Sequence Diagram(s)

sequenceDiagram
    participant Worker as PikaReplBgWorker
    participant Logger as Logger
    Worker->>Worker: HandleBGWorkerWriteBinlog()
    alt Session ID mismatch detected
        Worker->>Logger: Log session mismatch and warning
        note right of Worker: No state change to kTryConnect performed
    end
Loading

Suggested labels

☢️ Bug

Suggested reviewers

  • wangshao1
  • chejinge
  • AlexStocks

Poem

I’m a rabbit with code, so nimble and spry,
Skipping past state changes that once made me sigh.
Session mismatches log their tune in the night,
Yet I hop on in code, keeping things light.
With each bug left behind, my heart feels just right!
🐇


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 990f2b4 and f9425e5.

📒 Files selected for processing (1)
  • src/pika_repl_bgworker.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/pika_repl_bgworker.cc

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added 🧹 Updates This will not be worked on Invalid PR Title labels Mar 10, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/pika_repl_bgworker.cc (1)

129-130: Disable Reconnection on Expired Task

By commenting out the slave_db->SetReplState(ReplState::kTryConnect); line when a session ID mismatch occurs, this change now properly ignores expired BinlogSync tasks. This aligns with the PR objective to avoid triggering extra reconnection attempts that lead to synchronization deadlocks. Ensure that the ignored task scenario is monitored (if needed) so that persistent issues do not go unnoticed.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 590afe9 and 990f2b4.

📒 Files selected for processing (1)
  • src/pika_repl_bgworker.cc (1 hunks)

@morningli morningli changed the title Update pika_repl_bgworker.cc fix: Solve the problem of expired BinlogSync request causing master-slave synchronization to get stuck Mar 10, 2025
@cheniujh
Copy link
Collaborator

cheniujh commented Mar 14, 2025

  1. 请问本问题是否发生在3.5.4以及之前的版本?是否能稳定复现呢?

  2. 从节点如果在binlog队列还有过期任务的情况下转入tryconnect,发出trysync请求,也会触发其他问题(我称作二次trysync问题)。我修复过相关的主从问题(3.5.5才发版,之前的版本没有带上),主要做的修改是:在从节点处理trysync response的时候,使用binlog thread去同步处理,也就是必须等到从节点解除阻塞,前面的binlog都消化完,从节点才会去处理这个trysync response,他自己的session id才会去增长,所以我理解,现在被你注释掉的这句代码现在也是很难被执行到了(当然也有可能存在其他情况会走入这个分支,所以之前做修复的时候我没有直接注释掉它)。

辛苦您看一下下面这个PR中的第二个部分:
#2638

请确认一下,在这个PR合入之后,您提到的问题是否还存在,辛苦了!

如果是最新代码出现的这个问题,可以考虑是不是从节点的写阻塞还没解除,处于WaitReply状态(发出了trysync请求,还没对trysync的response进行处理),这种情况等下去应该是能恢复的。

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


  1. Does this problem occur in 3.5.4 and previous versions? Can it be reproduced stably?

  2. If the slave node goes to tryconnect when there are expired tasks in the binlog queue and issues a trysync request, it will also trigger other problems (I call it a quadratic trysync problem). I have fixed the relevant master-slave problem (3.5.5 is only released, and the previous version was not included). The main modification is: When the slave node is processing the trysync response, use the binlog thread to synchronize the processing, that is, you must wait until the slave node is unblocked and the previous binlog is digested, so the slave node will process this trysync response, and its own session id will grow. So I understand that the code commented out by you is now difficult to execute now (of course, there may be other situations that will enter this branch, so I did not comment out it directly when I was repairing it).

Thank you for your hard work, take a look at the second part of the following PR:
#2638

Please confirm whether the problem you mentioned still exists after this PR is merged. Thank you for your hard work!

@morningli
Copy link
Author

我的是3.3.6版本,可能你修复的版本跟我的版本有点偏差,产生的问题似乎有点不一样,不过触发的原因是类似的,也是连续两次trysync引发的问题,请问一下当时为什么不直接注释而是采用了同步的方式进行处理,直接丢弃不匹配的会有什么问题吗,感觉这样修复也更简单。之前为什么采用的是异步处理,现在改成同步后性能会不会有什么问题

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


Mine is version 3.3.6. Maybe the version you fixed is a bit different from mine. The problems that appear to be a little different. However, the triggering reason is similar, and it is also a problem caused by trysync twice in a row. I would like to ask why I didn’t comment directly but used the synchronization method to process it. Is there any problem that will be discarded directly from the mismatch? I feel that the repair is easier. Why did it use asynchronous processing before? Is there any problem with performance after changing it to synchronization?

@cheniujh
Copy link
Collaborator

cheniujh commented Mar 21, 2025

我的是3.3.6版本,可能你修复的版本跟我的版本有点偏差,产生的问题似乎有点不一样,不过触发的原因是类似的,也是连续两次trysync引发的问题,请问一下当时为什么不直接注释而是采用了同步的方式进行处理,直接丢弃不匹配的会有什么问题吗,感觉这样修复也更简单。之前为什么采用的是异步处理,现在改成同步后性能会不会有什么问题

1.不采用注释的方式进行处理主要是无法确定之前这个代码到底处理了什么case,害怕注释掉之后出新的问题,将trysync从异步改成同步的话,至少带来的影响,以及潜在的副作用是能搞清楚的。

  1. 采用了同步改异步方案之后,造成的影响是,从节点夯住,超时重连之后,因为trysync resp被延迟处理了,从节点会一直处于waitReply状态,直到从节点解除阻塞(消费完或者丢弃完了过期的binlog)才能成功建联,也就是说主从重连会延迟(master_status_link会更久的时间处于down状态)。在主从同步性能方面没有实质影响,因为哪怕你异步处理trysync resp,让主从重连,从节点的RocksDB解除阻塞之前也是没有消费能力的,所以性能上来说和你的那个方法没有显著差异。
    还有一个关联PR你可以看一下:fix: add a user-friendly repl metric "repl_connect_status" in the resp of info command #2656

(另外在当前这种改异步处理的方案下,还有一种极端情况是,从节点阻塞超过40s(两个超时周期)的情况下,后面主从还能否恢复连接,这点经过验证是可以的。)

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


My version is 3.3.6. Maybe the version you fixed is a bit different from mine. The problems that appear to be a little different. However, the triggering reason is similar, and it is also a problem caused by trysync twice in a row. I would like to ask why I didn’t comment directly but used the synchronization method to process it. Is there any problem that I will directly discard the mismatch? I feel that the repair is easier. Why did it use asynchronous processing before? Is there any problem with performance after changing it to synchronization?

  1. If you don’t use comments, it is mainly because you cannot determine what case the code has processed before, and you are afraid that new problems will arise after commenting out.

  2. After synchronization is adopted, the impact is that after the slave node is stuck and timed out and reconnected, because trysync resp is delayed, the slave node will remain in the waitReply state until the slave node is unblocked (consuming or discarding the expired binlog), which means that the master-slave reconnection will be delayed (master_status_link will be in the down state for a longer time). As for the master-slave synchronization performance, there is no substantial impact, because even if you asynchronously handle trysync resp and let the master-slave reconnect, the slave node's RocksDB has no consumption capacity before unblocking, so there is no significant difference in performance with your method.
    There is also a related PR you can check: fix: add a user-friendly repl metric "repl_connect_status" in the resp of info command #2656

(In addition, under the current asynchronous processing solution, there is another extreme case, if the slave node blocks for more than 40s (two timeout cycles), whether the master and slave can resume the connection after the later are verified.)

@morningli
Copy link
Author

感觉我这种方式风险小一点,毕竟这个问题本身是小概率事件,不过我也不确定会不会引发其他问题,先跑一段时间看看

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


I feel that my approach is less risky. After all, this problem itself is a low probability event, but I am not sure if it will cause other problems. Let's go for a while to see it.

@cheniujh
Copy link
Collaborator

感觉我这种方式风险小一点,毕竟这个问题本身是小概率事件,不过我也不确定会不会引发其他问题,先跑一段时间看看

是的,可以先跑跑看

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


I feel that my approach is less risky. After all, this problem itself is a low probability event, but I am not sure if it will cause other problems. Let's go for a while to see

Yes, you can run and watch first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🧹 Updates This will not be worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants