Skip to content

CRIU hang at the final steps of restoration on aarch64 architecture #2720

@juranvrsk

Description

@juranvrsk

Problem Description
During restoration CRIU stopped in infinite waiting at the first wait4 of compel_stop_on_syscall function. The processes and threads states:

    TID    PPID S COMMAND         CMD                                                                                                                                                                                                                                       
 105409  105403 t payload         /payload                                                                                                                                                                                                                                  
 105436  105403 t m_1             /payload                                                                                                                                                                                                                                  
 105437  105403 t m_2             /payload                                                                                                                                                                                                                                  
 105438  105403 S m_3             /payload                                                                                                                                                                                                                                  
 105427  105409 t payload         /payload                                                                                                                                                                                                                                  
 105429  105409 t b_1             /payload                                                                                                                                                                                                                                  
 105430  105409 t b_2             /payload                                                                                                                                                                                                                                  
 105431  105409 t b_3             /payload                                                                                                                                                                                                                                  
 105428  105409 t payload         /payload                                                                                                                                                                                                                                  
 105432  105409 t a_1             /payload                                                                                                                                                                                                                                  
 105433  105409 t a_2             /payload                                                                                                                                                                                                                                  
 105434  105409 t a_3             /payload 

Some tasks (it can be any kind of task), are not interrupted by ptrace, it’s state is “S”. The CRIU restore log:

(00.079924) 105434 was trapped                                                                                                                                                                                                                                              
(00.079966) 105433 was trapped                                                                                                                                                                                                                                              
(00.079998) 105434 was trapped                                                                                                                                                                                                                                              
(00.080005) 105434 (native) is going to execute the syscall 139, required is 139                                                                                                                                                                                            
(00.080034) 105434 was stopped                                                                                                                                                                                                                                              
(00.080039) 105433 was trapped                                                                                                                                                                                                                                              
(00.080044) 105433 (native) is going to execute the syscall 139, required is 139                                                                                                                                                                                            
(00.080084) 105433 was stopped                                                                                                                                                                                                                                              
(00.080109) 105432 was trapped                                                                                                                                                                                                                                              
(00.080141) 105431 was trapped                                                                                                                                                                                                                                              
(00.080170) 105432 was trapped                                                                                                                                                                                                                                              
(00.080177) 105432 (native) is going to execute the syscall 139, required is 139                                                                                                                                                                                            
(00.080214) 105432 was stopped                                                                                                                                                                                                                                              
(00.080221) 105431 was trapped                                                                                                                                                                                                                                              
(00.080225) 105431 (native) is going to execute the syscall 139, required is 139                                                                                                                                                                                            
(00.080250) 105431 was stopped                                                                                                                                                                                                                                              
(00.080255) 105428 was trapped                                                                                                                                                                                                                                              
(00.080277) 105430 was trapped                                                                                                                                                                                                                                              
(00.080310) 105428 was trapped                                                                                                                                                                                                                                              
(00.080314) 105428 (native) is going to execute the syscall 139, required is 139                                                                                                                                                                                            
(00.080377) 105428 was stopped                                                                                                                                                                                                                                              
(00.080384) 105430 was trapped                                                                                                                                                                                                                                              
(00.080389) 105430 (native) is going to execute the syscall 139, required is 139                                                                                                                                                                                            
(00.080420) 105430 was stopped                                                                                                                                                                                                                                              
(00.080425) 105429 was trapped                                                                                                                                                                                                                                              
(00.080457) 105427 was trapped                                                                                                                                                                                                                                              
(00.080490) 105429 was trapped                                                                                                                                                                                                                                              
(00.080495) 105429 (native) is going to execute the syscall 139, required is 139                                                                                                                                                                                            
(00.080528) 105429 was stopped                                                                                                                                                                                                                                              
(00.080535) 105427 was trapped                                                                                                                                                                                                                                              
(00.080539) 105427 (native) is going to execute the syscall 139, required is 139                                                                                                                                                                                            
(00.080566) 105427 was stopped                                                                                                                                                                                                                                              
(00.080572) 105437 was trapped                                                                                                                                                                                                                                              
(00.080597) 105436 was trapped                                                                                                                                                                                                                                              
(00.080629) 105437 was trapped                                                                                                                                                                                                                                              
(00.080633) 105437 (native) is going to execute the syscall 139, required is 139                                                                                                                                                                                            
(00.080658) 105437 was stopped                                                                                                                                                                                                                                              
(00.080665) 105436 was trapped                                                                                                                                                                                                                                              
(00.080670) 105436 (native) is going to execute the syscall 139, required is 139                                                                                                                                                                                            
(00.080698) 105436 was stopped                                                                                                                                                                                                                                              
(00.080703) 105409 was trapped                                                                                                                                                                                                                                              
(00.080754) 105409 was trapped                                                                                                                                                                                                                                              
(00.080759) 105409 (native) is going to execute the syscall 139, required is 139                                                                                                                                                                                            
(00.080788) 105409 was stopped 

When I force sent a SIGTRAP to task in “S“ state (sent it to the thread “m_3” with pid 105438 ), it pass the first wait4, then stops on the next wait4 after is_required_syscall check, which not pass, because the task is not received rt-sigreturn syscall (it receives any other syscalls):

(177.49258) 105438 was trapped                                                                                                                                                                                                                                              
(177.49267) 105438 was trapped                                                                                                                                                                                                                                              
(177.49268) 105438 (native) is going to execute the syscall 98, required is 139 

When I do restore on the x86_64 machine, the picture of process states on the compel_stop_on_syscall is next:

    TID    PPID S COMMAND         CMD                                                                                                                                                                                                                                       
   7290    7283 t payload         /payload                                                                                                                                                                                                                                  
   7317    7283 t m_1             /payload                                                                                                                                                                                                                                  
   7318    7283 t m_2             /payload                                                                                                                                                                                                                                  
   7319    7283 t m_3             /payload                                                                                                                                                                                                                                  
   7308    7290 t payload         /payload                                                                                                                                                                                                                                  
   7313    7290 t b_1             /payload                                                                                                                                                                                                                                  
   7314    7290 t b_2             /payload                                                                                                                                                                                                                                  
   7315    7290 t b_3             /payload                                                                                                                                                                                                                                  
   7309    7290 t payload         /payload                                                                                                                                                                                                                                  
   7310    7290 t a_1             /payload
   7311    7290 t a_2             /payload                                                                                                                                                                                                                                  
   7312    7290 t a_3             /payload 

The tasks are stopped fine, no endless waiting, and restore is successful.

Environment

OS				Arch		Kernel			CRIU	Crun	Podman
Ubuntu 22.04	x86_64	5.15.0-25-generic	4.1.1	1.4.3	3.3.4
Ubuntu 22.04 	aarch64	5.15.0-25-generic	4.1.1	1.4.3	3.3.4
Fedora 41 		x86_64	6.13.10-200.fc41	4.1.1	1.17	5.2.5
Fedora 41		aarch64	6.11.4-301.fc41		4.1.1	1.17	5.2.5

CRIU settings in /etc/criu/default.conf:

shell-job
skip-in-flight
tcp-close
file-locks

The test application(payload) runs into podman container.
The images used for container creation are same with OS versions. Just rebuild image with adding of test application binary and set it as the container entrypoint.
The checkpoint and restore commands:

podman container checkpoint --keep --tcp-established --export payload-test.tar.gz payload-test
podman container restore --keep --tcp-established –import payload-test.tar.gz

The test application source is here: payload2.c
Application written in c and built by command: gcc -o payload payload2.c -lpthread

The test application conception
The application creates two child processes using fork. Each processes, including parent process are create three threads.
Parent process thread names: m_1, m_2, m_3;
Child process A thread names: a_1, a_2, a_3;
Child process B threads names: b_1, b_2, b_3;
The first threads: m_1, a_1, b_1 works with epoll, polling the state of the unix sockets of the process.Triggering epoll causes the name of the triggering thread print to the console.
Threads m_2, a_3, b_2 are send data(write system call);
Threads m_3, a_2, b_3 are receive data(read system call);
The data transmitted through sockets is the names of the sending threads.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions