-
Notifications
You must be signed in to change notification settings - Fork 678
Description
Problem Description
During restoration CRIU stopped in infinite waiting at the first wait4 of compel_stop_on_syscall function. The processes and threads states:
TID PPID S COMMAND CMD
105409 105403 t payload /payload
105436 105403 t m_1 /payload
105437 105403 t m_2 /payload
105438 105403 S m_3 /payload
105427 105409 t payload /payload
105429 105409 t b_1 /payload
105430 105409 t b_2 /payload
105431 105409 t b_3 /payload
105428 105409 t payload /payload
105432 105409 t a_1 /payload
105433 105409 t a_2 /payload
105434 105409 t a_3 /payload
Some tasks (it can be any kind of task), are not interrupted by ptrace, it’s state is “S”. The CRIU restore log:
(00.079924) 105434 was trapped
(00.079966) 105433 was trapped
(00.079998) 105434 was trapped
(00.080005) 105434 (native) is going to execute the syscall 139, required is 139
(00.080034) 105434 was stopped
(00.080039) 105433 was trapped
(00.080044) 105433 (native) is going to execute the syscall 139, required is 139
(00.080084) 105433 was stopped
(00.080109) 105432 was trapped
(00.080141) 105431 was trapped
(00.080170) 105432 was trapped
(00.080177) 105432 (native) is going to execute the syscall 139, required is 139
(00.080214) 105432 was stopped
(00.080221) 105431 was trapped
(00.080225) 105431 (native) is going to execute the syscall 139, required is 139
(00.080250) 105431 was stopped
(00.080255) 105428 was trapped
(00.080277) 105430 was trapped
(00.080310) 105428 was trapped
(00.080314) 105428 (native) is going to execute the syscall 139, required is 139
(00.080377) 105428 was stopped
(00.080384) 105430 was trapped
(00.080389) 105430 (native) is going to execute the syscall 139, required is 139
(00.080420) 105430 was stopped
(00.080425) 105429 was trapped
(00.080457) 105427 was trapped
(00.080490) 105429 was trapped
(00.080495) 105429 (native) is going to execute the syscall 139, required is 139
(00.080528) 105429 was stopped
(00.080535) 105427 was trapped
(00.080539) 105427 (native) is going to execute the syscall 139, required is 139
(00.080566) 105427 was stopped
(00.080572) 105437 was trapped
(00.080597) 105436 was trapped
(00.080629) 105437 was trapped
(00.080633) 105437 (native) is going to execute the syscall 139, required is 139
(00.080658) 105437 was stopped
(00.080665) 105436 was trapped
(00.080670) 105436 (native) is going to execute the syscall 139, required is 139
(00.080698) 105436 was stopped
(00.080703) 105409 was trapped
(00.080754) 105409 was trapped
(00.080759) 105409 (native) is going to execute the syscall 139, required is 139
(00.080788) 105409 was stopped
When I force sent a SIGTRAP to task in “S“ state (sent it to the thread “m_3” with pid 105438 ), it pass the first wait4, then stops on the next wait4 after is_required_syscall check, which not pass, because the task is not received rt-sigreturn syscall (it receives any other syscalls):
(177.49258) 105438 was trapped
(177.49267) 105438 was trapped
(177.49268) 105438 (native) is going to execute the syscall 98, required is 139
When I do restore on the x86_64 machine, the picture of process states on the compel_stop_on_syscall is next:
TID PPID S COMMAND CMD
7290 7283 t payload /payload
7317 7283 t m_1 /payload
7318 7283 t m_2 /payload
7319 7283 t m_3 /payload
7308 7290 t payload /payload
7313 7290 t b_1 /payload
7314 7290 t b_2 /payload
7315 7290 t b_3 /payload
7309 7290 t payload /payload
7310 7290 t a_1 /payload
7311 7290 t a_2 /payload
7312 7290 t a_3 /payload
The tasks are stopped fine, no endless waiting, and restore is successful.
Environment
OS Arch Kernel CRIU Crun Podman
Ubuntu 22.04 x86_64 5.15.0-25-generic 4.1.1 1.4.3 3.3.4
Ubuntu 22.04 aarch64 5.15.0-25-generic 4.1.1 1.4.3 3.3.4
Fedora 41 x86_64 6.13.10-200.fc41 4.1.1 1.17 5.2.5
Fedora 41 aarch64 6.11.4-301.fc41 4.1.1 1.17 5.2.5
CRIU settings in /etc/criu/default.conf:
shell-job
skip-in-flight
tcp-close
file-locks
The test application(payload) runs into podman container.
The images used for container creation are same with OS versions. Just rebuild image with adding of test application binary and set it as the container entrypoint.
The checkpoint and restore commands:
podman container checkpoint --keep --tcp-established --export payload-test.tar.gz payload-test
podman container restore --keep --tcp-established –import payload-test.tar.gz
The test application source is here: payload2.c
Application written in c and built by command: gcc -o payload payload2.c -lpthread
The test application conception
The application creates two child processes using fork. Each processes, including parent process are create three threads.
Parent process thread names: m_1, m_2, m_3;
Child process A thread names: a_1, a_2, a_3;
Child process B threads names: b_1, b_2, b_3;
The first threads: m_1, a_1, b_1 works with epoll, polling the state of the unix sockets of the process.Triggering epoll causes the name of the triggering thread print to the console.
Threads m_2, a_3, b_2 are send data(write system call);
Threads m_3, a_2, b_3 are receive data(read system call);
The data transmitted through sockets is the names of the sending threads.