[BUG] Unexpected behavior with load_state_from_peers in the Optimizer (state_averager)

For the case where one peer calls load_state_from_peers and another peer cannot fulfill the request before the timeout this leads to two unexpected behaviors:

1. The timeout does happen as expected. Though the original call may still get full filled while a new call is already launched. Expected behavior is that once the timeout condition is met the request is canceled.
2. In the case where multiple timeouts occur, the host machine serving the state will accumulate torch artifacts of the states in the shared memory instead of clearing those once timeout condition is met. This can cause a build-up in the shared memory.

**Environment**
Please list:
* python version : 3.11.11
* hivemind.__version__; '1.2.0.dev0', 02184953d95f6182e42318a05c0876c4190dca89
* Please copy and paste the output from pytorch [environment collection script]
Collecting environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.10.0-34-cloud-amd64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L4
Nvidia driver version: 550.90.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               8
On-line CPU(s) list:                  0-7
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU @ 2.20GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   4
Socket(s):                            1
Stepping:                             7
BogoMIPS:                             4400.38
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            128 KiB (4 instances)
L1i cache:                            128 KiB (4 instances)
L2 cache:                             4 MiB (4 instances)
L3 cache:                             38.5 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-7
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==2.2.3
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pytorch-ranger==0.1.1
[pip3] torch==2.6.0
[pip3] torch-optimizer==0.1.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] triton==3.2.0
[conda] numpy                     2.2.3                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] torch                     2.6.0                    pypi_0    pypi
[conda] torch-optimizer           0.1.0                    pypi_0    pypi
[conda] torchaudio                2.6.0                    pypi_0    pypi
[conda] torchvision               0.21.0                   pypi_0    pypi
[conda] triton                    3.2.0                    pypi_0    pypi

**Extra information**

An example of 1 can be seen in the following log with 2 active peers and one peer joining. Here the timeout is set to 1.5 seconds and the actual download takes ~3 seconds.
Note: A similar behavior was observed in large scale and standard timeouts.

INFO:hivemind.averaging.averager:Downloading parameters from peer QmWejdf6Gc6iQaYypja7QwxpkT6ekuBdGBPqAvLaXtEkGf
ERROR:hivemind.optim.optimizer:Failed to load state from peers: , retrying ...
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/hivemind/optim/optimizer.py", line 696, in load_state_from_peers
self.state_averager.load_state_from_peers(timeout=self.load_state_timeout, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/hivemind/optim/state_averager.py", line 667, in load_state_from_peers
loaded_state = super().load_state_from_peers(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/hivemind/averaging/averager.py", line 682, in load_state_from_peers
return future.result(timeout=timeout) if wait else future
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/hivemind/utils/mpfuture.py", line 254, in result
return super().result(timeout)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/concurrent/futures/_base.py", line 458, in result
raise TimeoutError()
TimeoutError
INFO:hivemind.averaging.averager:Downloading parameters from peer QmQHg48wueYkWxQn9hnthkD2oFmLfnyG8ekJXWT94BYKug
ERROR:hivemind.optim.optimizer:Failed to load state from peers: , retrying ...
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/hivemind/optim/optimizer.py", line 696, in load_state_from_peers
self.state_averager.load_state_from_peers(timeout=self.load_state_timeout, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/hivemind/optim/state_averager.py", line 667, in load_state_from_peers
loaded_state = super().load_state_from_peers(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/hivemind/averaging/averager.py", line 682, in load_state_from_peers
return future.result(timeout=timeout) if wait else future
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/hivemind/utils/mpfuture.py", line 254, in result
return super().result(timeout)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/concurrent/futures/_base.py", line 458, in result
raise TimeoutError()
TimeoutError
INFO:hivemind.averaging.averager:Finished downloading state in 3.133s from QmWejdf6Gc6iQaYypja7QwxpkT6ekuBdGBPqAvLaXtEkGf
INFO:hivemind.averaging.averager:Downloading parameters from peer QmWejdf6Gc6iQaYypja7QwxpkT6ekuBdGBPqAvLaXtEkGf
ERROR:hivemind.optim.optimizer:Failed to load state from peers: , retrying ...
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/hivemind/optim/optimizer.py", line 696, in load_state_from_peers
self.state_averager.load_state_from_peers(timeout=self.load_state_timeout, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/hivemind/optim/state_averager.py", line 667, in load_state_from_peers
loaded_state = super().load_state_from_peers(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/hivemind/averaging/averager.py", line 682, in load_state_from_peers
return future.result(timeout=timeout) if wait else future
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/hivemind/utils/mpfuture.py", line 254, in result
return super().result(timeout)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/concurrent/futures/_base.py", line 458, in result
raise TimeoutError()
TimeoutError
INFO:hivemind.averaging.averager:Downloading parameters from peer QmQHg48wueYkWxQn9hnthkD2oFmLfnyG8ekJXWT94BYKug
ERROR:hivemind.optim.optimizer:Failed to load state from peers: , retrying ...
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/hivemind/optim/optimizer.py", line 696, in load_state_from_peers
self.state_averager.load_state_from_peers(timeout=self.load_state_timeout, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/hivemind/optim/state_averager.py", line 667, in load_state_from_peers
loaded_state = super().load_state_from_peers(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/hivemind/averaging/averager.py", line 682, in load_state_from_peers
return future.result(timeout=timeout) if wait else future
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/hivemind/utils/mpfuture.py", line 254, in result
return super().result(timeout)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/concurrent/futures/_base.py", line 458, in result
raise TimeoutError()
TimeoutError
INFO:hivemind.averaging.averager:Downloading parameters from peer QmWejdf6Gc6iQaYypja7QwxpkT6ekuBdGBPqAvLaXtEkGf
INFO:hivemind.averaging.averager:Finished downloading state in 5.100s from QmQHg48wueYkWxQn9hnthkD2oFmLfnyG8ekJXWT94BYKug
ERROR:hivemind.optim.optimizer:Failed to load state from peers: , retrying ...
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/hivemind/optim/optimizer.py", line 696, in load_state_from_peers
self.state_averager.load_state_from_peers(timeout=self.load_state_timeout, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/hivemind/optim/state_averager.py", line 667, in load_state_from_peers
loaded_state = super().load_state_from_peers(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/hivemind/averaging/averager.py", line 682, in load_state_from_peers
return future.result(timeout=timeout) if wait else future
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/hivemind/utils/mpfuture.py", line 254, in result
return super().result(timeout)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/concurrent/futures/_base.py", line 458, in result
raise TimeoutError()
TimeoutError

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Unexpected behavior with load_state_from_peers in the Optimizer (state_averager) #653

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Unexpected behavior with load_state_from_peers in the Optimizer (state_averager) #653

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions