Skip to content

Reporting SMP malfunction in ARM64 environment #93545

@dylan-hong-nc

Description

@dylan-hong-nc

Describe the bug

Hello, I would like to report a critical issue in ARM64 SMP environment.
This issue started from the following Q&A.
Title: Problems SMP an new SoC(ARM64: Cortex-A53) based board. #92278

The problematic part is the _isr_wrapper function in arch/arm64/core/isr_wrapper.S.

Image

In line 81 of the figure, the range of allowed irq numbers is loaded into the x1 register and compared with x0.
Assuming that 'CONFIG_NUM_IRQS' is 256, the value loaded into x1 will be 255.

The value of the x0 register is the value of the GICC_IAR register. GICC_IAR returns CPUID in [12:10] and InterruptID in [9:0].
For example, if the CPUID of the enabled interrupt is 0 and the InterruptID is 2, the value returned in x0 is 2.
In this case, since the result of cmp x0, x1 is x0 < x1, the branch to 'spurious_continue' is ignored and isr-handler is called.

However, if the CPUID of the enabled interrupt is 1 and the InterruptID is 2, the value returned in x0 is 1026.
In this case, since the result of cmp x0, x1 is x0 > x1, isr-handler is not called and the branch to 'spurious_continue' is called.

This causes the IPI(SGI)-Handler to be called only for CPU0, which causes a scheduling deadlock.

I modified it to backup x0 as follows and compare only InterruptID with x1.
The reason for backing up x0 is that CPUID should be included in the value that should be written as 'GICC_EOIR' in the arm_gic_eoi function called on line 105.

Image

I verified that SMP TESTSUITE runs normally as shown in 'Relevant log output' by modifying the code and enabling 'TICKET_SPINLOCKS' option.

Please review and apply it.
(Since it was tested in a gicv2 environment, it is necessary to confirm that it works normally in gicv3.)

Thank you.

※Note
I also attach my kconfig and devicetree files.
zephyr.zip

Regression

  • This is a regression.

Steps to reproduce

No response

Relevant log output

NOTICE:  DRAM_FREQ: 1866 [MHz]
NOTICE:  EXT OSC     : 24 MHz
NOTICE:  DDR DENSITY : 2048 MB
NOTICE:  BL31: v2.10.4(release):v2.7.0-3028-g9fa250ea9-dirty
NOTICE:  BL31: Built : 17:22:41, Jul 22 2025
NOTICE:  APACHE5 SIP SVC Version 0.8
NOTICE:  verify header done - rsd: 2, sclk: 67000000, pol(H)-pha(H), quad
NOTICE:  leave nc_loader_read_and_verify_header...


U-Boot 2025.04-00012-g893965a34251 (Jul 18 2025 - 12:02:05 +0900)

DRAM:  2 GiB
Core:  39 devices, 16 uclasses, devicetree: fit
MMC:   mmc@34000000: 0, mmc@35000000: 1
Loading Environment from SPIFlash... default rx sample delay: 2
SF: Detected mt25ql512a with page size 256 Bytes, erase size 4 KiB, total 64 MiB
OK
In:    serial
Out:   serial
Err:   serial
Net:   eth0: gmac0@43020000
Hit any key to stop autoboot:  0
PHY Reset
MEGACHIPS MAAE1003S ID OK : 0x1a d414
Using gmac0@43020000 device
TFTP from server 192.168.13.31; our IP address is 192.168.13.35
Filename 'dylan7h/zephyr/zephyr.bin'.
Load address: 0x80000000
Loading: #######################
         3.8 MiB/s
done
Bytes transferred = 115408 (1c2d0 hex)
## Starting application at 0x80000000 ...
*** Booting Zephyr OS build v4.2.0-60-gf3f03ce13f25 ***
Secondary CPU core 1 (MPID:0x1) is up
Secondary CPU core 2 (MPID:0x2) is up
Secondary CPU core 3 (MPID:0x3) is up
Running TESTSUITE smp
===================================================================
START - test_coop_resched_threads
 PASS - test_coop_resched_threads in 0.051 seconds
===================================================================
START - test_coop_switch_in_abort
 PASS - test_coop_switch_in_abort in 0.201 seconds
===================================================================
START - test_cpu_id_threads
 PASS - test_cpu_id_threads in 1.001 seconds
===================================================================
START - test_fatal_on_smp
[00:00:01.276,000] <err> os: ELR_ELn: 0x00000000800014e8
[00:00:01.277,000] <err> os: ESR_ELn: 0x0000000056000002
[00:00:01.278,000] <err> os:   EC:  0x15 (Unknown)
[00:00:01.279,000] <err> os:   IL:  0x1
[00:00:01.280,000] <err> os:   ISS: 0x2
[00:00:01.281,000] <err> os: TPIDRRO: 0x010000008002cf98
[00:00:01.282,000] <err> os: x0:  0x0000000000000000  x1:  0x0000000000000000
[00:00:01.284,000] <err> os: x2:  0x0000000000000000  x3:  0x0000000000000040
[00:00:01.285,000] <err> os: x4:  0x0000000000000001  x5:  0x0000000000000000
[00:00:01.287,000] <err> os: x6:  0x0000000000000000  x7:  0x0000000000000000
[00:00:01.288,000] <err> os: x8:  0x0000000000000003  x9:  0x0000000000000000
[00:00:01.290,000] <err> os: x10: 0x0000000080440990  x11: 0x0000000080003750
[00:00:01.292,000] <err> os: x12: 0x00000000804409e8  x13: 0x0000000000000000
[00:00:01.293,000] <err> os: x14: 0x0000000000000000  x15: 0xffffffffffffffff
[00:00:01.295,000] <err> os: x16: 0x0000000000000000  x17: 0x0000000000000000
[00:00:01.296,000] <err> os: x18: 0x0000000000000000  lr:  0x0000000080003748
[00:00:01.298,000] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:01.299,000] <err> os: Current thread: 0x8002ad80 (unknown)
[00:00:01.526,000] <err> os: ELR_ELn: 0x00000000800014e8
[00:00:01.527,000] <err> os: ESR_ELn: 0x0000000056000002
[00:00:01.528,000] <err> os:   EC:  0x15 (Unknown)
[00:00:01.529,000] <err> os:   IL:  0x1
[00:00:01.530,000] <err> os:   ISS: 0x2
[00:00:01.531,000] <err> os: TPIDRRO: 0x010000008002cfd8
[00:00:01.532,000] <err> os: x0:  0x0000000000000000  x1:  0x0000000000000000
[00:00:01.534,000] <err> os: x2:  0x0000000000000000  x3:  0x000000000000032b
[00:00:01.535,000] <err> os: x4:  0x0000000080018b69  x5:  0x0000000080017b3a
[00:00:01.537,000] <err> os: x6:  0x0000000000000000  x7:  0x0000000000000002
[00:00:01.538,000] <err> os: x8:  0x0000000000000003  x9:  0x0000000000000000
[00:00:01.540,000] <err> os: x10: 0x0000000080441590  x11: 0x0000000080003750
[00:00:01.541,000] <err> os: x12: 0x00000000804415e8  x13: 0x000000008002d000
[00:00:01.543,000] <err> os: x14: 0x00000000804414e0  x15: 0x0000000000000000
[00:00:01.545,000] <err> os: x16: 0x0000000000000001  x17: 0x00000000000000bb
[00:00:01.546,000] <err> os: x18: 0x000000008002b0e0  lr:  0x000000008000254c
[00:00:01.548,000] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 1
[00:00:01.549,000] <err> os: Current thread: 0x8002b0e0 (test_fatal_on_smp)
 PASS - test_fatal_on_smp in 0.276 seconds
===================================================================
START - test_get_cpu
 PASS - test_get_cpu in 0.051 seconds
===================================================================
START - test_inc_concurrency
type 0: cnt 60000, spend 3 ms
type 1: cnt 60000, spend 112 ms
type 2: cnt 60000, spend 113 ms
 PASS - test_inc_concurrency in 0.228 seconds
===================================================================
START - test_preempt_resched_threads
 PASS - test_preempt_resched_threads in 0.512 seconds
===================================================================
START - test_sleep_threads
 PASS - test_sleep_threads in 1.001 seconds
===================================================================
START - test_smp_coop_threads
 PASS - test_smp_coop_threads in 0.613 seconds
===================================================================
START - test_smp_ipi
cpu num=4 PASS - test_smp_ipi in 0.301 seconds
===================================================================
START - test_smp_release_global_lock
 PASS - test_smp_release_global_lock in 0.021 seconds
===================================================================
START - test_smp_switch_torture
 PASS - test_smp_switch_torture in 2.116 seconds
===================================================================
START - test_wakeup_threads
 PASS - test_wakeup_threads in 0.201 seconds
===================================================================
START - test_workq_on_smp
 PASS - test_workq_on_smp in 0.051 seconds
===================================================================
START - test_yield_threads
 PASS - test_yield_threads in 0.301 seconds
===================================================================
TESTSUITE smp succeeded

------ TESTSUITE SUMMARY START ------

SUITE PASS - 100.00% [smp]: pass = 15, fail = 0, skip = 0, total = 15 duration = 6.925 seconds
 - PASS - [smp.test_coop_resched_threads] duration = 0.051 seconds
 - PASS - [smp.test_coop_switch_in_abort] duration = 0.201 seconds
 - PASS - [smp.test_cpu_id_threads] duration = 1.001 seconds
 - PASS - [smp.test_fatal_on_smp] duration = 0.276 seconds
 - PASS - [smp.test_get_cpu] duration = 0.051 seconds
 - PASS - [smp.test_inc_concurrency] duration = 0.228 seconds
 - PASS - [smp.test_preempt_resched_threads] duration = 0.512 seconds
 - PASS - [smp.test_sleep_threads] duration = 1.001 seconds
 - PASS - [smp.test_smp_coop_threads] duration = 0.613 seconds
 - PASS - [smp.test_smp_ipi] duration = 0.301 seconds
 - PASS - [smp.test_smp_release_global_lock] duration = 0.021 seconds
 - PASS - [smp.test_smp_switch_torture] duration = 2.116 seconds
 - PASS - [smp.test_wakeup_threads] duration = 0.201 seconds
 - PASS - [smp.test_workq_on_smp] duration = 0.051 seconds
 - PASS - [smp.test_yield_threads] duration = 0.301 seconds

------ TESTSUITE SUMMARY END ------

===================================================================
PROJECT EXECUTION SUCCESSFUL

Impact

Not sure

Environment

  • HOST : ubnutu-20.04(x86_64)
  • Zephyr:
    • sdk-version: v0.17.0 and v0.17.2
    • version: v4.1.0 and v4.2.0
    • note: zephyr rtos operates in non-secure area and activates secondary core with psci via TF-A (BL31).
      For reference, the same TF-A is shared and used by Linux and commercial rtos.
  • SoC:
    • vendor: Nextchip
    • model: nvs2900(apache5)
    • core: cortex-a53

Additional Context

No response

Metadata

Metadata

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions