-
Notifications
You must be signed in to change notification settings - Fork 163
Description
A number of users on the forum are reporting kernel panics on baremetal installs, or lockups of guests on virtualisation platforms, when running the DS3615xs image specifically. This is typically precipitated by running docker in general, or certain docker images, but may also have been caused by high IO load in some circumstances. A common feature is use of databases (notably influxdb, mariadb, mysql and elasticsearch) but also nginx and jdownloader2.
This has been observed on baremetal HP Gen7 and Gen8 servers, proxmox and ESXi with a variety of Xeon CPUs (E3-1265L V2, E3-1270 V2, E3-1241 V3, E3-1220L V2 and E3-1265L V4), Celeron and AMD.
Most users are on DSM7.0.1-RC1, but I also observed this behaviour when on DSM6.2.4
(edit: also confirmed to affect 7.0 beta and 7.0.1, ie. not the release candidate)
Conversely, a number of users with DS918+ images have reported no issues with running docker or known problematic images (in my case influxdb causes a 100% reproducible crash).
On my baremetal HP Gen8 running 6.2.4 I get the following console output before a reboot:
[ 191.452302] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
[ 191.487637] CPU: 3 PID: 19775 Comm: containerd-shim Tainted: PF O 3.10.105 #25556
[ 191.528112] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[ 191.562597] ffffffff814c904d ffffffff814c8121 0000000000000010 ffff880109ac8d58
[ 191.599118] ffff880109ac8cf0 0000000000000000 0000000000000003 000000000000002c
[ 191.634943] 0000000000000003 ffffffff80000001 0000000000000010 ffff880103817c00
[ 191.670604] Call Trace:
[ 191.682506] <NMI> [<ffffffff814c904d>] ? dump_stack+0xc/0x15
[ 191.710494] [<ffffffff814c8121>] ? panic+0xbb/0x1ce
[ 191.735108] [<ffffffff810a0922>] ? watchdog_overflow_callback+0xb2/0xc0
[ 191.768203] [<ffffffff810b152b>] ? __perf_event_overflow+0x8b/0x240
[ 191.799789] [<ffffffff810b02d4>] ? perf_event_update_userpage+0x14/0xf0
[ 191.834349] [<ffffffff81015411>] ? intel_pmu_handle_irq+0x1d1/0x360
[ 191.865505] [<ffffffff81010026>] ? perf_event_nmi_handler+0x26/0x40
[ 191.897683] [<ffffffff81005fa8>] ? do_nmi+0xf8/0x3e0
[ 191.922372] [<ffffffff814cfa53>] ? end_repeat_nmi+0x1e/0x7e
[ 191.950899] <<EOE>>
[ 191.961095] Rebooting in 3 seconds..
This is seen by others on baremetal when using docker. Virtualisation platform users see 100% CPU usage on their xpenology guest and it becomes unresponsive, requiring a restart of the guest. The majority of kernel panics cite containerd-shim as being at fault, but sometimes (rarely) it will list a process being run inside a docker container (notably influxdb in my case).
This is notably similar to an issue logged with RHEL a number of years ago that they note was fixed in a subsequent kernel release:
https://access.redhat.com/solutions/1354963