Skip to content

[BUG] PID out of range due to API change of NVIDIA R525 driver #181

@xieshuaix

Description

@xieshuaix

Required prerequisites

  • I have read the documentation https://nvitop.readthedocs.io.
  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

What version of nvitop are you using?

1.5.3

Operating system and version

Ubuntu 20.04.5 LTS (Focal Fossa)

NVIDIA driver version

525.125.06

NVIDIA-SMI

Fri Aug 22 12:46:15 2025       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800-SXM...  On   | 00000000:AD:00.0 Off |                    0 |
| N/A   60C    P0   265W / 400W |  78988MiB / 81920MiB |     87%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800-SXM...  On   | 00000000:B1:00.0 Off |                    0 |
| N/A   38C    P0   170W / 400W |  41012MiB / 81920MiB |     88%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A800-SXM...  On   | 00000000:D0:00.0 Off |                    0 |
| N/A   40C    P0   179W / 400W |  42182MiB / 81920MiB |     86%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A800-SXM...  On   | 00000000:D3:00.0 Off |                    0 |
| N/A   45C    P0   171W / 400W |  41012MiB / 81920MiB |     88%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Python environment

3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0] linux
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py @ file:///home/conda/feedstock_root/build_artifacts/nvidia-ml-py_1746576379096/work
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.9.86
nvidia-nvtx-cu12==12.1.105
nvitop @ file:///home/conda/feedstock_root/build_artifacts/nvitop_1755346934447/work
onnxruntime-gpu==1.19.0

Problem description

In my case, this bug occurs when I use supervisord to launch system-level service that runs some GPU code and then use nvitop, killing those processes launched with supervisord solves the problem.

Steps to Reproduce

In my environment, this can be stably reproduced by using supervisord to launch script that runs GPU code.
Not sure if this can be reproduced on other platform.

Traceback

Traceback (most recent call last):
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/psutil/__init__.py", line 327, in _init
    _psplatform.cext.check_pid_range(pid)
OverflowError: signed integer is greater than maximum

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniforge3/envs/xs_stepfun/bin/nvitop", line 10, in <module>
    sys.exit(main())
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/cli.py", line 382, in main
    tui.print()
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/tui.py", line 235, in print
    self.main_screen.print()
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/__init__.py", line 191, in print
    print_width = min(panel.print_width() for panel in self.container)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/__init__.py", line 191, in <genexpr>
    print_width = min(panel.print_width() for panel in self.container)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 658, in print_width
    self.ensure_snapshots()
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 356, in ensure_snapshots
    self.snapshots = self.take_snapshots()
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/caching.py", line 220, in wrapped
    result = func(*args, **kwargs)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 360, in take_snapshots
    snapshots = GpuProcess.take_snapshots(self.processes, failsafe=True)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 409, in processes
    return list(
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 410, in <genexpr>
    itertools.chain.from_iterable(device.processes().values() for device in self.devices),  # type: ignore[misc]
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/device.py", line 2271, in processes
    proc = processes[p.pid] = self.GPU_PROCESS_CLASS(
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/library/process.py", line 29, in __new__
    instance = super().__new__(cls, *args, **kwargs)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/process.py", line 486, in __new__
    instance._host = HostProcess(pid)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/process.py", line 213, in __new__
    host.Process._init(instance, pid, True)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/psutil/__init__.py", line 330, in _init
    raise NoSuchProcess(pid, msg=msg) from err
psutil.NoSuchProcess: process PID out of range (pid=2529165312)

Logs

[DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::_nvmLookupFunctionPointer: Found symbol `nvm\DeviceGetMemoryInfo_v2`.
[DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::_nvmLookupFunctionPointer: Failed to found symbol `nvm\DeviceGetTemperatureV`.
[DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version.

Expected behavior

Exception handled gracefully and nvitop keeps runnning ignoring processes causing exception.

Additional context

I am using nvitop in jupyterlab in docker container.

Metadata

Metadata

Assignees

Labels

apiSomething related to the core APIsbugSomething isn't workingpynvmlSomething related to the `nvidia-ml-py` packageupstreamSomething upstream related

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions