-
Notifications
You must be signed in to change notification settings - Fork 211
Open
Description
Describe the bug
Today I upgraded my server to kernel 6.11.0. Now the intel_gpu_plugin pod crashes on startup:
I0620 19:14:07.738018 1 gpu_plugin.go:799] GPU device plugin started with none preferred allocation policy
I0620 19:14:07.738781 1 gpu_plugin.go:518] GPU (i915/xe) resource share count = 1
I0620 19:14:07.746264 1 gpu_plugin.go:540] GPU scan update: 0->1 'i915' resources found
I0620 19:14:07.746274 1 gpu_plugin.go:540] GPU scan update: 0->1 'i915_monitoring' resources found
I0620 19:14:08.748712 1 server.go:285] Start server for i915 at: /var/lib/kubelet/device-plugins/gpu.intel.com-i915.sock
I0620 19:14:08.748928 1 server.go:285] Start server for i915_monitoring at: /var/lib/kubelet/device-plugins/gpu.intel.com-i915_monitoring.sock
I0620 19:14:08.935243 1 server.go:303] Device plugin for i915_monitoring registered
E0620 19:14:08.935267 1 manager.go:146] Failed to serve gpu.intel.com/i915_monitoring: too many open files
Failed to create watcher for /var/lib/kubelet/device-plugins/gpu.intel.com-i915_monitoring.sock
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.watchFile
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:325
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).setupAndServe
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:307
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).Serve
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:225
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*Manager).handleUpdate.func1
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/manager.go:144
runtime.goexit
runtime/asm_amd64.s:1700
To Reproduce
- have a working system on kernel 6.8.0-60
- upgrade kernel to 6.11.0
System (please complete the following information):
- OS version: Ubuntu 24.04.2 LTS
- Kernel version: 6.11.0-26-generic (HWE kernel)
- Device plugins version: 0.32.1
- Hardware info: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz w/ Intel Corporation Comet Lake UHD Graphics (rev 04)
Additional context
i915-related kernel log:
2025-06-20T09:21:04.929852+00:00 nuc1 kernel: i915 0000:00:02.0: vgaarb: deactivate vga console
2025-06-20T09:21:04.929853+00:00 nuc1 kernel: i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=io+mem
2025-06-20T09:21:04.929857+00:00 nuc1 kernel: mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915])
2025-06-20T09:21:04.929858+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] [ENCODER:94:DDI B/PHY B] is disabled/in DSI mode with an ungated DDI clock, gate it
2025-06-20T09:21:04.929859+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4)
2025-06-20T09:21:04.929860+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] [ENCODER:107:DDI C/PHY C] is disabled/in DSI mode with an ungated DDI clock, gate it
2025-06-20T09:21:04.929861+00:00 nuc1 kernel: [drm] Initialized i915 1.6.0 20230929 for 0000:00:02.0 on minor 0
2025-06-20T09:21:04.929866+00:00 nuc1 kernel: i915 display info: display version: 9
2025-06-20T09:21:04.929867+00:00 nuc1 kernel: i915 display info: cursor_needs_physical: no
2025-06-20T09:21:04.929868+00:00 nuc1 kernel: i915 display info: has_cdclk_crawl: no
2025-06-20T09:21:04.929868+00:00 nuc1 kernel: i915 display info: has_cdclk_squash: no
2025-06-20T09:21:04.929869+00:00 nuc1 kernel: i915 display info: has_ddi: yes
2025-06-20T09:21:04.929869+00:00 nuc1 kernel: i915 display info: has_dp_mst: yes
2025-06-20T09:21:04.929873+00:00 nuc1 kernel: i915 display info: has_dsb: no
2025-06-20T09:21:04.929874+00:00 nuc1 kernel: i915 display info: has_fpga_dbg: yes
2025-06-20T09:21:04.929874+00:00 nuc1 kernel: i915 display info: has_gmch: no
2025-06-20T09:21:04.929875+00:00 nuc1 kernel: i915 display info: has_hotplug: yes
2025-06-20T09:21:04.929875+00:00 nuc1 kernel: i915 display info: has_hti: no
2025-06-20T09:21:04.929876+00:00 nuc1 kernel: i915 display info: has_ipc: yes
2025-06-20T09:21:04.929880+00:00 nuc1 kernel: i915 display info: has_overlay: no
2025-06-20T09:21:04.929881+00:00 nuc1 kernel: i915 display info: has_psr: yes
2025-06-20T09:21:04.929881+00:00 nuc1 kernel: i915 display info: has_psr_hw_tracking: yes
2025-06-20T09:21:04.929881+00:00 nuc1 kernel: i915 display info: overlay_needs_physical: no
2025-06-20T09:21:04.929882+00:00 nuc1 kernel: i915 display info: supports_tv: no
2025-06-20T09:21:04.929883+00:00 nuc1 kernel: i915 display info: has_hdcp: yes
2025-06-20T09:21:04.929887+00:00 nuc1 kernel: i915 display info: has_dmc: yes
2025-06-20T09:21:04.929888+00:00 nuc1 kernel: i915 display info: has_dsc: no
2025-06-20T09:21:04.929888+00:00 nuc1 kernel: i915 display info: rawclk rate: 24000 kHz
2025-06-20T09:21:04.929892+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
2025-06-20T09:21:04.929893+00:00 nuc1 kernel: snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
2025-06-20T09:21:04.929906+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
2025-06-20T09:21:04.929906+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
2025-06-20T09:28:37.973545+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 4 times, consider switching to WQ_UNBOUND
2025-06-20T09:30:10.134450+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 8 times, consider switching to WQ_UNBOUND
2025-06-20T09:33:14.457483+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 16 times, consider switching to WQ_UNBOUND
2025-06-20T09:39:23.096440+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 32 times, consider switching to WQ_UNBOUND
2025-06-20T09:51:39.862434+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 64 times, consider switching to WQ_UNBOUND
2025-06-20T10:16:12.886390+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 128 times, consider switching to WQ_UNBOUND
2025-06-20T11:07:13.109457+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 256 times, consider switching to WQ_UNBOUND
2025-06-20T12:45:26.743434+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 512 times, consider switching to WQ_UNBOUND
2025-06-20T14:50:04.435332+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Found COMETLAKE/ULT (device ID 9bca) display version 9.00 stepping N/A
2025-06-20T14:50:04.435336+00:00 nuc1 kernel: i915 0000:00:02.0: vgaarb: deactivate vga console
2025-06-20T14:50:04.435336+00:00 nuc1 kernel: i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=io+mem
2025-06-20T14:50:04.435337+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4)
2025-06-20T14:50:04.435339+00:00 nuc1 kernel: mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915])
2025-06-20T14:50:04.435339+00:00 nuc1 kernel: [drm] Initialized i915 1.6.0 for 0000:00:02.0 on minor 0
2025-06-20T14:50:04.435351+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
2025-06-20T14:50:04.435352+00:00 nuc1 kernel: snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
2025-06-20T14:50:04.435358+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
2025-06-20T14:53:02.508879+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 4 times, consider switching to WQ_UNBOUND
2025-06-20T14:53:25.546924+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 5 times, consider switching to WQ_UNBOUND
2025-06-20T14:54:10.605907+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 7 times, consider switching to WQ_UNBOUND
2025-06-20T14:55:41.740937+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 11 times, consider switching to WQ_UNBOUND
2025-06-20T15:05:13.642903+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 19 times, consider switching to WQ_UNBOUND
2025-06-20T15:11:20.747894+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 35 times, consider switching to WQ_UNBOUND
2025-06-20T15:23:34.445898+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 67 times, consider switching to WQ_UNBOUND
2025-06-20T15:48:07.978925+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 131 times, consider switching to WQ_UNBOUND
2025-06-20T16:36:50.475924+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 259 times, consider switching to WQ_UNBOUND
2025-06-20T18:15:09.741901+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 515 times, consider switching to WQ_UNBOUND
Metadata
Metadata
Assignees
Labels
No labels