Skip to content

Conversation

@fdavid-amd
Copy link
Contributor

This patch set, in combination with the accompanying kernel patch set, is the
second draft of work to support amdgpu dmabuf IPC. Since the first post, the
patches have been split to make them more readable and made
backwards-compatible.

@github-actions
Copy link

A friendly reminder that this PR had no activity for 30 days.

@avagin avagin added no-auto-close Don't auto-close as a stale issue and removed stale-pr labels May 19, 2025
@avagin
Copy link
Member

avagin commented May 19, 2025

@fdavid-amd how is it going? Let me know if you need help with this pr.

@fdavid-amd fdavid-amd force-pushed the dmabuf-post branch 4 times, most recently from b057779 to 456978a Compare June 18, 2025 18:12
amdgpu represents allocated device memory as a memory mapping
of the device file. This is a non-standard VMA that must
be handled by the plugin, not the normal VMA code.

Ignore all VMAs on device files.

Signed-off-by: David Francis <David.Francis@amd.com>
amdgpu dmabuf CRIU requires the ability of the amdgpu plugin
to retry.

Change files_ext.c to read a response of 1 from a plugin restore
function to mean retry.

Signed-off-by: David Francis <David.Francis@amd.com>
@fdavid-amd fdavid-amd force-pushed the dmabuf-post branch 3 times, most recently from 45f796c to 878a313 Compare July 8, 2025 14:05
@fdavid-amd
Copy link
Contributor Author

@avagin
Sorry for leaving this untouched for so long, I was revising the kernel side of the changes. Could you take a look at this again?

@fdavid-amd
Copy link
Contributor Author

Some of the build failures I don't understand - they might be an issue with this patch set or they might be an issue with the CI. There's a lot of 'Error: ipv6: address already assigned.', which could possibly be a result of the way that I'm allocating the plugin socket but seems unlikely.
The linter errors I do understand - they're flagging real grammar and syntax errors in the amdgpu-drm.h and drm.h files - but the point of those files is that they're exact copies of the kernel header files of the same names, so I don't want to change them.

@fdavid-amd
Copy link
Contributor Author

@avagin sorry for the repeat ping, but there is something of a rush on my part to get kernel patch https://lists.freedesktop.org/archives/dri-devel/2025-July/514922.html merged in the next week or so (it adds the CHANGE_HANDLE ioctl). As part of merging that patch, I need to demonstrate that it has written and reasonably stable userspace code that uses it (which would be these CRIU patches). It would be a big help to me if you could say if the general approach taken by these patches is acceptable or unacceptable.
Thanks,
David

@avagin
Copy link
Member

avagin commented Oct 21, 2025

@fdavid-amd sorry for the delay. Honestly, I am not a fan of the current approach. I don't understand why the same logic cannot be implemented in the plugin code. It could be something like this: https://gist.github.com/avagin/4484274ec19ac10a8966b5b2800533c0.

@fdavid-amd
Copy link
Contributor Author

I see. Updated with that approach

@avagin
Copy link
Member

avagin commented Oct 29, 2025

LGTM

@fdavid-amd
Copy link
Contributor Author

What are the next steps to merging this?

@rst0git
Copy link
Member

rst0git commented Oct 30, 2025

@fdavid-amd I've rebased the patches in this PR on the criu-dev branch to run the CI tests.

@avagin
Copy link
Member

avagin commented Oct 31, 2025

@fdavid-amd I used clang-format to reformat C files and fixed one issue that was a root cause of CI failures. I pushed all changes in https://github.com/checkpoint-restore/criu/tree/wip/amdgpu. Let know if I messed up something.

@fdavid-amd
Copy link
Contributor Author

Oh wait
Those pr_info lines beginning "TWI:" are debugs, I didn't mean to push those
Sorry

@fdavid-amd
Copy link
Contributor Author

Now fixed

@rst0git
Copy link
Member

rst0git commented Oct 31, 2025

@fdavid-amd Would it be possible to provide more information on how we can test these patches?


CR_PLUGIN_HOOK__POST_FORKING = 12,

CR_PLUGIN_HOOK__RESTORE_INIT = 13,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to update plugins/amdgpu/README.md with information about the new plugin callbacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That README doesn't seem to contain information about plugin callbacks; I think the appropriate place for such documentation might be the callback functions themselves or criu-plugin.h
In any case I hope it is clear what RESTORE_INIT does

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*This new plugin is enabled by the new hook `__UPDATE_VMA_MAP` in our RFC patch

*This new plugin is enabled by the new hook `__RESUME_DEVICES_LATE` in our RFC

*This optimization technique is enabled by the `__POST_FORKING` hook.*

Copy link
Member

@rst0git rst0git Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this documentation is to help new users to understand how CRIU and the AMD GPU Plugin work. Most users don't read the comments in header files like criu-plugin.h.

amdgpu libraries that use dmabuf fd to share GPU memory between
processes close the dmabuf fds immediately after using them.
However, it is possible that checkpoint of a process catches one
of the dmabuf fds open. In that case, the amdgpu plugin needs
to handle it.

The checkpoint of the dmabuf fd does require the device file
it was exported from to have already been dumped

To identify which device this dmabuf fd was exprted from, attempt
to import it on each device, then record the dmabuf handle
it imports as. This handle can be used to restore it.

Signed-off-by: David Francis <David.Francis@amd.com>
@avagin
Copy link
Member

avagin commented Oct 31, 2025

@fdavid-amd I used clang-format to reformat C files and fixed one issue that was a root cause of CI failures. I pushed all changes in https://github.com/checkpoint-restore/criu/tree/wip/amdgpu. Let know if I messed up something.

@fdavid-amd pls do all changes on top of the wip/amdgpu branch. I spent my time to fix ci tests.

@avagin avagin changed the base branch from criu-dev to wip/amdgpu October 31, 2025 15:41
@rst0git
Copy link
Member

rst0git commented Nov 1, 2025

@fdavid-amd I've applied the patches here with @avagin's changes from the wip/amdgpu branch in the following pull request: #2810

intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 3, 2025
Add new ioctl DRM_IOCTL_AMDGPU_GEM_LIST_HANDLES.

This ioctl returns a list of bos with their handles, sizes,
and flags and domains.

This ioctl is meant to be used during CRIU checkpoint and
provide information needed to reconstruct the bos
in CRIU restore.

Userspace for this and the next change can be found at
checkpoint-restore/criu#2613

Signed-off-by: David Francis <David.Francis@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 3, 2025
Add new GEM_OP_IOCTL option GET_MAPPING_INFO, which
returns a list of mappings associated with a given bo, along with
their positions and offsets.

Userspace for this and the previous change can be found at
checkpoint-restore/criu#2613

Signed-off-by: David Francis <David.Francis@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Nov 9, 2025
Add new ioctl DRM_IOCTL_AMDGPU_GEM_LIST_HANDLES.

This ioctl returns a list of bos with their handles, sizes,
and flags and domains.

This ioctl is meant to be used during CRIU checkpoint and
provide information needed to reconstruct the bos
in CRIU restore.

Userspace for this and the next change can be found at
checkpoint-restore/criu#2613

Signed-off-by: David Francis <David.Francis@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
lutzbichler pushed a commit to lutzbichler/drm-kmod that referenced this pull request Nov 9, 2025
Add new GEM_OP_IOCTL option GET_MAPPING_INFO, which
returns a list of mappings associated with a given bo, along with
their positions and offsets.

Userspace for this and the previous change can be found at:
checkpoint-restore/criu#2613

Signed-off-by: David Francis <David.Francis@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-auto-close Don't auto-close as a stale issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants