Skip to content

intel_gpu_plugin fails to start: Failed to serve gpu.intel.com/i915: too many open files #2075

@clambin

Description

@clambin

Describe the bug
Today I upgraded my server to kernel 6.11.0. Now the intel_gpu_plugin pod crashes on startup:

I0620 19:14:07.738018       1 gpu_plugin.go:799] GPU device plugin started with none preferred allocation policy
I0620 19:14:07.738781       1 gpu_plugin.go:518] GPU (i915/xe) resource share count = 1
I0620 19:14:07.746264       1 gpu_plugin.go:540] GPU scan update: 0->1 'i915' resources found
I0620 19:14:07.746274       1 gpu_plugin.go:540] GPU scan update: 0->1 'i915_monitoring' resources found
I0620 19:14:08.748712       1 server.go:285] Start server for i915 at: /var/lib/kubelet/device-plugins/gpu.intel.com-i915.sock
I0620 19:14:08.748928       1 server.go:285] Start server for i915_monitoring at: /var/lib/kubelet/device-plugins/gpu.intel.com-i915_monitoring.sock
I0620 19:14:08.935243       1 server.go:303] Device plugin for i915_monitoring registered
E0620 19:14:08.935267       1 manager.go:146] Failed to serve gpu.intel.com/i915_monitoring: too many open files
Failed to create watcher for /var/lib/kubelet/device-plugins/gpu.intel.com-i915_monitoring.sock
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.watchFile
        github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:325
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).setupAndServe
        github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:307
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).Serve
        github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:225
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*Manager).handleUpdate.func1
        github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/manager.go:144
runtime.goexit
        runtime/asm_amd64.s:1700

To Reproduce

  • have a working system on kernel 6.8.0-60
  • upgrade kernel to 6.11.0

System (please complete the following information):

  • OS version: Ubuntu 24.04.2 LTS
  • Kernel version: 6.11.0-26-generic (HWE kernel)
  • Device plugins version: 0.32.1
  • Hardware info: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz w/ Intel Corporation Comet Lake UHD Graphics (rev 04)

Additional context
i915-related kernel log:

2025-06-20T09:21:04.929852+00:00 nuc1 kernel: i915 0000:00:02.0: vgaarb: deactivate vga console
2025-06-20T09:21:04.929853+00:00 nuc1 kernel: i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=io+mem
2025-06-20T09:21:04.929857+00:00 nuc1 kernel: mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915])
2025-06-20T09:21:04.929858+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] [ENCODER:94:DDI B/PHY B] is disabled/in DSI mode with an ungated DDI clock, gate it
2025-06-20T09:21:04.929859+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4)
2025-06-20T09:21:04.929860+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] [ENCODER:107:DDI C/PHY C] is disabled/in DSI mode with an ungated DDI clock, gate it
2025-06-20T09:21:04.929861+00:00 nuc1 kernel: [drm] Initialized i915 1.6.0 20230929 for 0000:00:02.0 on minor 0
2025-06-20T09:21:04.929866+00:00 nuc1 kernel: i915 display info: display version: 9
2025-06-20T09:21:04.929867+00:00 nuc1 kernel: i915 display info: cursor_needs_physical: no
2025-06-20T09:21:04.929868+00:00 nuc1 kernel: i915 display info: has_cdclk_crawl: no
2025-06-20T09:21:04.929868+00:00 nuc1 kernel: i915 display info: has_cdclk_squash: no
2025-06-20T09:21:04.929869+00:00 nuc1 kernel: i915 display info: has_ddi: yes
2025-06-20T09:21:04.929869+00:00 nuc1 kernel: i915 display info: has_dp_mst: yes
2025-06-20T09:21:04.929873+00:00 nuc1 kernel: i915 display info: has_dsb: no
2025-06-20T09:21:04.929874+00:00 nuc1 kernel: i915 display info: has_fpga_dbg: yes
2025-06-20T09:21:04.929874+00:00 nuc1 kernel: i915 display info: has_gmch: no
2025-06-20T09:21:04.929875+00:00 nuc1 kernel: i915 display info: has_hotplug: yes
2025-06-20T09:21:04.929875+00:00 nuc1 kernel: i915 display info: has_hti: no
2025-06-20T09:21:04.929876+00:00 nuc1 kernel: i915 display info: has_ipc: yes
2025-06-20T09:21:04.929880+00:00 nuc1 kernel: i915 display info: has_overlay: no
2025-06-20T09:21:04.929881+00:00 nuc1 kernel: i915 display info: has_psr: yes
2025-06-20T09:21:04.929881+00:00 nuc1 kernel: i915 display info: has_psr_hw_tracking: yes
2025-06-20T09:21:04.929881+00:00 nuc1 kernel: i915 display info: overlay_needs_physical: no
2025-06-20T09:21:04.929882+00:00 nuc1 kernel: i915 display info: supports_tv: no
2025-06-20T09:21:04.929883+00:00 nuc1 kernel: i915 display info: has_hdcp: yes
2025-06-20T09:21:04.929887+00:00 nuc1 kernel: i915 display info: has_dmc: yes
2025-06-20T09:21:04.929888+00:00 nuc1 kernel: i915 display info: has_dsc: no
2025-06-20T09:21:04.929888+00:00 nuc1 kernel: i915 display info: rawclk rate: 24000 kHz
2025-06-20T09:21:04.929892+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
2025-06-20T09:21:04.929893+00:00 nuc1 kernel: snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
2025-06-20T09:21:04.929906+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
2025-06-20T09:21:04.929906+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
2025-06-20T09:28:37.973545+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 4 times, consider switching to WQ_UNBOUND
2025-06-20T09:30:10.134450+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 8 times, consider switching to WQ_UNBOUND
2025-06-20T09:33:14.457483+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 16 times, consider switching to WQ_UNBOUND
2025-06-20T09:39:23.096440+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 32 times, consider switching to WQ_UNBOUND
2025-06-20T09:51:39.862434+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 64 times, consider switching to WQ_UNBOUND
2025-06-20T10:16:12.886390+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 128 times, consider switching to WQ_UNBOUND
2025-06-20T11:07:13.109457+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 256 times, consider switching to WQ_UNBOUND
2025-06-20T12:45:26.743434+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 512 times, consider switching to WQ_UNBOUND
2025-06-20T14:50:04.435332+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Found COMETLAKE/ULT (device ID 9bca) display version 9.00 stepping N/A
2025-06-20T14:50:04.435336+00:00 nuc1 kernel: i915 0000:00:02.0: vgaarb: deactivate vga console
2025-06-20T14:50:04.435336+00:00 nuc1 kernel: i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=io+mem
2025-06-20T14:50:04.435337+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4)
2025-06-20T14:50:04.435339+00:00 nuc1 kernel: mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915])
2025-06-20T14:50:04.435339+00:00 nuc1 kernel: [drm] Initialized i915 1.6.0 for 0000:00:02.0 on minor 0
2025-06-20T14:50:04.435351+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
2025-06-20T14:50:04.435352+00:00 nuc1 kernel: snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
2025-06-20T14:50:04.435358+00:00 nuc1 kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
2025-06-20T14:53:02.508879+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 4 times, consider switching to WQ_UNBOUND
2025-06-20T14:53:25.546924+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 5 times, consider switching to WQ_UNBOUND
2025-06-20T14:54:10.605907+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 7 times, consider switching to WQ_UNBOUND
2025-06-20T14:55:41.740937+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 11 times, consider switching to WQ_UNBOUND
2025-06-20T15:05:13.642903+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 19 times, consider switching to WQ_UNBOUND
2025-06-20T15:11:20.747894+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 35 times, consider switching to WQ_UNBOUND
2025-06-20T15:23:34.445898+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 67 times, consider switching to WQ_UNBOUND
2025-06-20T15:48:07.978925+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 131 times, consider switching to WQ_UNBOUND
2025-06-20T16:36:50.475924+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 259 times, consider switching to WQ_UNBOUND
2025-06-20T18:15:09.741901+00:00 nuc1 kernel: workqueue: i915_hpd_poll_init_work [i915] hogged CPU for >13333us 515 times, consider switching to WQ_UNBOUND

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions