Skip to content

Performance Degradation and OOM Events During API Calls to Update Validators #4936

@screwyprof

Description

@screwyprof

Description

Our Lighthouse instance grapples with a critical performance downturn when executing a batch job to update multiple validators in the local development environment. This degradation not only hinders Lighthouse's efficiency in fulfilling its validator duties but also, in numerous instances, triggers interventions from the OOM killer.

Version

  • Rust Version: 1.69.0
  • Production Version: v4.4.1
  • Dev Version: Latest unstable (commit: 051c3e84)

Present Behaviour

Lighthouse is hampered by significant performance degradation when executing a batch job to update multiple validators in the local development environment. The symptoms include:

  • A noticeable slowdown in performance during the batch job execution.
  • Frequent interventions from the OOM killer, leading to the termination of the Lighthouse process.
  • Modest but consistent memory consumption growth
  • Errors in the logs, suggesting potential issues with the logger buffer overflow:
Nov 13 21:16:43.600 ERRO slog-async: logger dropped messages due to channel overflow, count: 7
Nov 13 21:16:43.600 ERRO slog-async: logger dropped messages due to channel overflow, count: 5
Nov 13 21:16:43.600 ERRO slog-async: logger dropped messages due to channel overflow, count: 7

Attempts to capture a CPU profile pointed towards slog, indicating potential performance bottlenecks related to logging.

Expected Behaviour

Lighthouse should seamlessly update multiple validators without succumbing to notable performance degradation. The application's performance should remain optimal, and interventions from the OOM killer should be eliminated.

Steps to resolve

Efforts to address the issue involved:

  • CPU Profiling: Attempted CPU profiling which highlighted that a significant portion of time is allocated to async-slog, with logging consuming 1 second out of a 10-second profile. This observation suggested potential issues with async-slog and correlated with the errors in the logs, indicating a logger buffer overflow
  • Memory Profiling: Tried to capture heap profile with heaptrack with no luck.
  • Memory Monitoring: Lighthouse's memory consumption using the top command, noting modest but consistent growth.
  • Logger Optimisation: Increased logger buffer size and removed logger calls completely at the problematic endpoint, yet still encountered persistent OOM killer interventions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions