Skip to content

Conversation

@gyuho
Copy link
Member

@gyuho gyuho commented Sep 23, 2025

A manual triggerComponent/triggerComponentCheck request against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data.

Issue

  • Control-plane machine object continued to show nfs health at 2025-09-18T05:48:16Z, even though GPUd's local disk state had fresh timestamps (for example 2025-09-18T12:07:16Z).
  • GPUd logs repeated:
    • session reader: error decoding response: EOF
    • session writer: error making request ... context canceled
    • drained stale messages from reader channel during keep-alive resets.

Root Cause

  1. Session.serve processes every control-plane message in a single goroutine. Before the fix it handled triggerComponent synchronously, so it waited for comp.Check() to finish before writing any response.
  2. components/disk/component.go retries disk.GetPartitions up to five times per tracked mount, each wrapped in a one-minute timeout. A flaky /mnt/nfs-share can therefore block Check() for roughly five minutes per mount.
  3. While Check() blocks, the writer goroutine cannot drain the 20-item s.writer channel. The control plane eventually cancels the streaming HTTP request, triggering the session reader/session writer error messages and the drained stale messages warning when the keep-alive loop tears everything down.
  4. The disk component's background ticker keeps updating lastCheckResult, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes.

One manual triggerComponent aimed at the disk component is sufficient to deadlock the session loop. The request is enqueued, Session.serve synchronously calls the long-running disk.Check(), and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot.

Fix

  • triggerComponent will be processed asynchronously in a background goroutine.
  • The response will be written back to the control plane once complete, with the same ReqID.

c.f., #1078

@gyuho gyuho added this to the v0.8.0 milestone Sep 23, 2025
@gyuho gyuho self-assigned this Sep 23, 2025
…oid blocking the main serve loop

A manual `triggerComponent`/`triggerComponentCheck` request against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data.

**Issue**
- Control-plane machine object continued to show `nfs` health at `2025-09-18T05:48:16Z`, even though GPUd's local disk state had fresh timestamps (for example `2025-09-18T12:07:16Z`).
- GPUd logs repeated:
  - `session reader: error decoding response: EOF`
  - `session writer: error making request ... context canceled`
  - `drained stale messages from reader channel` during keep-alive resets.

**Root Cause**
1. `Session.serve` processes every control-plane message in a single goroutine. Before the fix it handled `triggerComponent` synchronously, so it waited for `comp.Check()` to finish before writing any response.
2. `components/disk/component.go` retries `disk.GetPartitions` up to **five** times per tracked mount, each wrapped in a **one-minute** timeout. A flaky `/mnt/nfs-share` can therefore block `Check()` for roughly five minutes per mount.
3. While `Check()` blocks, the writer goroutine cannot drain the 20-item `s.writer` channel. The control plane eventually cancels the streaming HTTP request, triggering the `session reader`/`session writer` error messages and the `drained stale messages` warning when the keep-alive loop tears everything down.
4. The disk component's background ticker keeps updating `lastCheckResult`, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes.

**One** manual `triggerComponent` aimed at the disk component is sufficient to deadlock the session loop. The request is enqueued, `Session.serve` synchronously calls the long-running `disk.Check()`, and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot.

**Fix**
- `triggerComponent` will be processed asynchronously in a background goroutine.
- The response will be written back to the control plane once complete, with the same ReqID.

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
@leptonai leptonai deleted a comment from codecov bot Sep 23, 2025
@codecov
Copy link

codecov bot commented Sep 23, 2025

Codecov Report

❌ Patch coverage is 81.14286% with 132 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.42%. Comparing base (3a92838) to head (7050935).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/session/logout.go 0.00% 26 Missing ⚠️
pkg/session/package_status.go 0.00% 25 Missing ⚠️
pkg/session/session_process_request.go 82.40% 17 Missing and 5 partials ⚠️
pkg/session/update.go 38.23% 18 Missing and 3 partials ⚠️
pkg/session/deregister.go 63.33% 9 Missing and 2 partials ⚠️
pkg/session/health_states.go 88.04% 9 Missing and 2 partials ⚠️
pkg/session/session_serve.go 81.35% 10 Missing and 1 partial ⚠️
pkg/session/bootstrap.go 86.95% 2 Missing and 1 partial ⚠️
pkg/session/inject_fault.go 91.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1082      +/-   ##
==========================================
+ Coverage   67.12%   67.42%   +0.29%     
==========================================
  Files         315      328      +13     
  Lines       26164    26260      +96     
==========================================
+ Hits        17563    17706     +143     
+ Misses       7717     7668      -49     
- Partials      884      886       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
@gyuho gyuho merged commit 1b82a1c into main Sep 23, 2025
12 of 13 checks passed
@gyuho gyuho deleted the LEP-2083-2 branch September 23, 2025 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants