Skip to content

Conversation

@gyuho
Copy link
Member

@gyuho gyuho commented Sep 18, 2025

A manual triggerComponent/triggerComponentCheck request against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data.

Issue

  • Control-plane machine object continued to show nfs health at 2025-09-18T05:48:16Z, even though GPUd's local disk state had fresh timestamps (for example 2025-09-18T12:07:16Z).
  • GPUd logs repeated:
    • session reader: error decoding response: EOF
    • session writer: error making request ... context canceled
    • drained stale messages from reader channel during keep-alive resets.

Root Cause

  1. Session.serve processes every control-plane message in a single goroutine. Before the fix it handled triggerComponent synchronously, so it waited for comp.Check() to finish before writing any response.
  2. components/disk/component.go retries disk.GetPartitions up to five times per tracked mount, each wrapped in a one-minute timeout. A flaky /mnt/nfs-share can therefore block Check() for roughly five minutes per mount.
  3. While Check() blocks, the writer goroutine cannot drain the 20-item s.writer channel. The control plane eventually cancels the streaming HTTP request, triggering the session reader/session writer error messages and the drained stale messages warning when the keep-alive loop tears everything down.
  4. The disk component's background ticker keeps updating lastCheckResult, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes.

One manual triggerComponent aimed at the disk component is sufficient to deadlock the session loop. The request is enqueued, Session.serve synchronously calls the long-running disk.Check(), and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot.

Fix

  • triggerComponent / triggerComponentCheck now capture the target component names, enqueue an immediate acknowledgement (with empty States placeholders) back to the control plane, and launch the expensive comp.Check() work in a separate goroutine. This keeps the session writer responsive even when disk checks spend minutes retrying disk.GetPartitions.

@gyuho gyuho added this to the v0.8.0 milestone Sep 18, 2025
@gyuho gyuho self-assigned this Sep 18, 2025
@gyuho gyuho force-pushed the LEP-2083 branch 2 times, most recently from 60c0cfe to b442da6 Compare September 18, 2025 14:36
… checks

A manual `triggerComponent`/`triggerComponentCheck` request against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data.

**Issue**
- Control-plane machine object continued to show `nfs` health at `2025-09-18T05:48:16Z`, even though GPUd's local disk state had fresh timestamps (for example `2025-09-18T12:07:16Z`).
- GPUd logs repeated:
  - `session reader: error decoding response: EOF`
  - `session writer: error making request ... context canceled`
  - `drained stale messages from reader channel` during keep-alive resets.

**Root Cause**
1. `Session.serve` processes every control-plane message in a single goroutine. Before the fix it handled `triggerComponent` synchronously, so it waited for `comp.Check()` to finish before writing any response.
2. `components/disk/component.go` retries `disk.GetPartitions` up to **five** times per tracked mount, each wrapped in a **one-minute** timeout. A flaky `/mnt/nfs-share` can therefore block `Check()` for roughly five minutes per mount.
3. While `Check()` blocks, the writer goroutine cannot drain the 20-item `s.writer` channel. The control plane eventually cancels the streaming HTTP request, triggering the `session reader`/`session writer` error messages and the `drained stale messages` warning when the keep-alive loop tears everything down.
4. The disk component's background ticker keeps updating `lastCheckResult`, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes.
**One** manual `triggerComponent` aimed at the disk component is sufficient to deadlock the session loop. The request is enqueued, `Session.serve` synchronously calls the long-running `disk.Check()`, and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot.

**Fix**
- `triggerComponent` / `triggerComponentCheck` now capture the target component names, enqueue an immediate acknowledgement (with empty `States` placeholders) back to the control plane, and launch the expensive `comp.Check()` work in a separate goroutine. This keeps the session writer responsive even when disk checks spend minutes retrying `disk.GetPartitions`.

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
@codecov
Copy link

codecov bot commented Sep 18, 2025

Codecov Report

❌ Patch coverage is 69.44444% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.29%. Comparing base (70316cf) to head (84e2f4b).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
pkg/session/serve.go 69.44% 8 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1078      +/-   ##
==========================================
+ Coverage   67.27%   67.29%   +0.02%     
==========================================
  Files         316      316              
  Lines       26312    26342      +30     
==========================================
+ Hits        17702    17728      +26     
- Misses       7726     7729       +3     
- Partials      884      885       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gyuho gyuho added the wip - do not merge working in progress label Sep 23, 2025
@gyuho gyuho closed this Sep 23, 2025
@gyuho gyuho deleted the LEP-2083 branch September 23, 2025 13:41
gyuho added a commit that referenced this pull request Sep 23, 2025
…oid blocking the main serve loop (#1082)

A manual `triggerComponent`/`triggerComponentCheck` request against the
disk component blocked GPUd's control-plane session loop. While the disk
health probe sat in a long retry cycle, no responses were written back
to the control plane, so GPUd appeared healthy locally but the control
plane stayed stuck on stale health data.

**Issue**
- Control-plane machine object continued to show `nfs` health at
`2025-09-18T05:48:16Z`, even though GPUd's local disk state had fresh
timestamps (for example `2025-09-18T12:07:16Z`).
- GPUd logs repeated:
  - `session reader: error decoding response: EOF`
  - `session writer: error making request ... context canceled`
- `drained stale messages from reader channel` during keep-alive resets.

**Root Cause**
1. `Session.serve` processes every control-plane message in a single
goroutine. Before the fix it handled `triggerComponent` synchronously,
so it waited for `comp.Check()` to finish before writing any response.
2. `components/disk/component.go` retries `disk.GetPartitions` up to
**five** times per tracked mount, each wrapped in a **one-minute**
timeout. A flaky `/mnt/nfs-share` can therefore block `Check()` for
roughly five minutes per mount.
3. While `Check()` blocks, the writer goroutine cannot drain the 20-item
`s.writer` channel. The control plane eventually cancels the streaming
HTTP request, triggering the `session reader`/`session writer` error
messages and the `drained stale messages` warning when the keep-alive
loop tears everything down.
4. The disk component's background ticker keeps updating
`lastCheckResult`, so GPUd's local state reflects the latest degraded
result. However, the control plane never receives that update because
the synchronous response never flushes.

**One** manual `triggerComponent` aimed at the disk component is
sufficient to deadlock the session loop. The request is enqueued,
`Session.serve` synchronously calls the long-running `disk.Check()`, and
the writer never sends a response. When the control plane cancels the
stuck request, GPUd restarts the session and the control plane still
holds the last good health snapshot.

**Fix**
- `triggerComponent` will be processed asynchronously in a background
goroutine.
- The response will be written back to the control plane once complete,
with the same ReqID.

c.f., #1078

---------

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wip - do not merge working in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant