-
Notifications
You must be signed in to change notification settings - Fork 191
Description
Required prerequisites
- I have read the documentation https://nvitop.readthedocs.io.
- I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
- I have tried the latest version of nvitop in a new isolated virtual environment.
What version of nvitop are you using?
1.5.3
Operating system and version
Ubuntu 20.04.5 LTS (Focal Fossa)
NVIDIA driver version
525.125.06
NVIDIA-SMI
Fri Aug 22 12:46:15 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800-SXM... On | 00000000:AD:00.0 Off | 0 |
| N/A 60C P0 265W / 400W | 78988MiB / 81920MiB | 87% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800-SXM... On | 00000000:B1:00.0 Off | 0 |
| N/A 38C P0 170W / 400W | 41012MiB / 81920MiB | 88% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A800-SXM... On | 00000000:D0:00.0 Off | 0 |
| N/A 40C P0 179W / 400W | 42182MiB / 81920MiB | 86% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A800-SXM... On | 00000000:D3:00.0 Off | 0 |
| N/A 45C P0 171W / 400W | 41012MiB / 81920MiB | 88% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Python environment
3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0] linux
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py @ file:///home/conda/feedstock_root/build_artifacts/nvidia-ml-py_1746576379096/work
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.9.86
nvidia-nvtx-cu12==12.1.105
nvitop @ file:///home/conda/feedstock_root/build_artifacts/nvitop_1755346934447/work
onnxruntime-gpu==1.19.0
Problem description
In my case, this bug occurs when I use supervisord to launch system-level service that runs some GPU code and then use nvitop, killing those processes launched with supervisord solves the problem.
Steps to Reproduce
In my environment, this can be stably reproduced by using supervisord to launch script that runs GPU code.
Not sure if this can be reproduced on other platform.
Traceback
Traceback (most recent call last):
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/psutil/__init__.py", line 327, in _init
_psplatform.cext.check_pid_range(pid)
OverflowError: signed integer is greater than maximum
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniforge3/envs/xs_stepfun/bin/nvitop", line 10, in <module>
sys.exit(main())
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/cli.py", line 382, in main
tui.print()
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/tui.py", line 235, in print
self.main_screen.print()
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/__init__.py", line 191, in print
print_width = min(panel.print_width() for panel in self.container)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/__init__.py", line 191, in <genexpr>
print_width = min(panel.print_width() for panel in self.container)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 658, in print_width
self.ensure_snapshots()
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 356, in ensure_snapshots
self.snapshots = self.take_snapshots()
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/caching.py", line 220, in wrapped
result = func(*args, **kwargs)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 360, in take_snapshots
snapshots = GpuProcess.take_snapshots(self.processes, failsafe=True)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 409, in processes
return list(
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 410, in <genexpr>
itertools.chain.from_iterable(device.processes().values() for device in self.devices), # type: ignore[misc]
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/device.py", line 2271, in processes
proc = processes[p.pid] = self.GPU_PROCESS_CLASS(
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/library/process.py", line 29, in __new__
instance = super().__new__(cls, *args, **kwargs)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/process.py", line 486, in __new__
instance._host = HostProcess(pid)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/process.py", line 213, in __new__
host.Process._init(instance, pid, True)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/psutil/__init__.py", line 330, in _init
raise NoSuchProcess(pid, msg=msg) from err
psutil.NoSuchProcess: process PID out of range (pid=2529165312)
Logs
[DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::_nvmLookupFunctionPointer: Found symbol `nvm\DeviceGetMemoryInfo_v2`.
[DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::_nvmLookupFunctionPointer: Failed to found symbol `nvm\DeviceGetTemperatureV`.
[DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version.
Expected behavior
Exception handled gracefully and nvitop keeps runnning ignoring processes causing exception.
Additional context
I am using nvitop in jupyterlab in docker container.