Skip to content

Conversation

@asafpamzn
Copy link

Summary

I'm implementing a COW-based live migration feature for CRIU that uses userfaultfd write-protection to track memory modifications while the process continues running. The goal is to combine it with the lazy support in order to be able to duplicate a process to remote instance while minimizing downtime compared to traditional dump modes.

Overview

  1. Write-protecting all writable memory pages using userfaultfd
  2. Resuming the process immediately after protection
  3. Capturing page contents on write faults before they're modified
  4. Transferring pages to destination while process continues running

High level flow

  1. Instead of dumping the entire memory mark VMAs with write protection
    In https://github.com/asafpamzn/criu/blob/criu-cow/criu/cr-dump.c#L1720

A new parasite to do the job
https://github.com/asafpamzn/criu/blob/criu-dev/criu/cow-dump.c#L197C1-L198C1
https://github.com/asafpamzn/criu/blob/a59a151c1e2fb6edfe899ab940698c5a412f75b1/criu/pie/parasite.c#L963

Question: I want to dump small VMAs and mark in write protect only large VMAs - How can I do it? I don't fully understand how I can combine VMAs as they are all pushed to the same page image file.

  1. Next, a new thread is getting the page faults and transfer the process.
    https://github.com/asafpamzn/criu/blob/criu-cow/criu/cr-dump.c#L1728
    https://github.com/asafpamzn/criu/blob/a59a151c1e2fb6edfe899ab940698c5a412f75b1/criu/cow-dump.c#L423
    https://github.com/asafpamzn/criu/blob/a59a151c1e2fb6edfe899ab940698c5a412f75b1/criu/cow-dump.c#L444

  2. Awake the source process
    https://github.com/asafpamzn/criu/blob/a59a151c1e2fb6edfe899ab940698c5a412f75b1/criu/cow-dump.c#L414

I'm in the early stages of learning the code. I will be happy to some guidance and advice.
Please let me know if it makes sense. I'm most concern about how I combine the memory areas as I want to write protect only large vmas

@rst0git
Copy link
Member

rst0git commented Nov 8, 2025

@asafpamzn There are too many patches in this pull request and it would be difficult for someone to comment on the changes.

The following document provides more information on how to contribute to CRIU:
https://github.com/checkpoint-restore/criu/blob/criu-dev/CONTRIBUTING.md

I'm implementing a COW-based live migration feature for CRIU that uses userfaultfd write-protection to track memory modifications while the process continues running. The goal is to combine it with the lazy support in order to be able to duplicate a process to remote instance while minimizing downtime compared to traditional dump modes.

I believe Mike Rapoport (@rppt) might be able to provide some advice about the idea.

@asafpamzn
Copy link
Author

Thanks @rst0git ,

Since it is a big change I would like to get advice about the general direction before starting to implement. I can provide a design doc if it works better. What is the best path going forward?
Should I consult with @rppt ?

@rst0git
Copy link
Member

rst0git commented Nov 8, 2025

Since it is a big change I would like to get advice about the general direction before starting to implement. I can provide a design doc if it works better. What is the best path going forward?

Creating a GitHub issue with more information about the use-case and why this functionality is important will help us to understand the proposed design.

Should I consult with @rppt ?

There are multiple people in the community that can provide feedback. Mike is a MM maintainer for the Linux kernel and contributed many of the patches that enable post-copy migration with userfaultfd.

@avagin
Copy link
Member

avagin commented Nov 9, 2025

Since it is a big change I would like to get advice about the general direction before starting to implement. I can provide a design doc if it works better. What is the best path going forward?

Let's start with a design doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants