-
Notifications
You must be signed in to change notification settings - Fork 892
Description
Description
Our Lighthouse instance grapples with a critical performance downturn when executing a batch job to update multiple validators in the local development environment. This degradation not only hinders Lighthouse's efficiency in fulfilling its validator duties but also, in numerous instances, triggers interventions from the OOM killer.
Version
- Rust Version:
1.69.0
- Production Version:
v4.4.1
- Dev Version: Latest unstable (commit:
051c3e84
)
Present Behaviour
Lighthouse is hampered by significant performance degradation when executing a batch job to update multiple validators in the local development environment. The symptoms include:
- A noticeable slowdown in performance during the batch job execution.
- Frequent interventions from the OOM killer, leading to the termination of the Lighthouse process.
- Modest but consistent memory consumption growth
- Errors in the logs, suggesting potential issues with the logger buffer overflow:
Nov 13 21:16:43.600 ERRO slog-async: logger dropped messages due to channel overflow, count: 7
Nov 13 21:16:43.600 ERRO slog-async: logger dropped messages due to channel overflow, count: 5
Nov 13 21:16:43.600 ERRO slog-async: logger dropped messages due to channel overflow, count: 7
Attempts to capture a CPU profile pointed towards slog
, indicating potential performance bottlenecks related to logging.
Expected Behaviour
Lighthouse should seamlessly update multiple validators without succumbing to notable performance degradation. The application's performance should remain optimal, and interventions from the OOM killer should be eliminated.
Steps to resolve
Efforts to address the issue involved:
- CPU Profiling: Attempted CPU profiling which highlighted that a significant portion of time is allocated to
async-slog
, with logging consuming 1 second out of a 10-second profile. This observation suggested potential issues withasync-slog
and correlated with the errors in the logs, indicating a logger buffer overflow - Memory Profiling: Tried to capture heap profile with
heaptrack
with no luck. - Memory Monitoring: Lighthouse's memory consumption using the
top
command, noting modest but consistent growth. - Logger Optimisation: Increased logger buffer size and removed logger calls completely at the problematic endpoint, yet still encountered persistent OOM killer interventions.