[LEP-2083] fix(session): async "triggerComponent" to prevent blocking checks #1078

gyuho · 2025-09-18T14:30:08Z

A manual triggerComponent/triggerComponentCheck request against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data.

Issue

Control-plane machine object continued to show nfs health at 2025-09-18T05:48:16Z, even though GPUd's local disk state had fresh timestamps (for example 2025-09-18T12:07:16Z).
GPUd logs repeated:
- session reader: error decoding response: EOF
- session writer: error making request ... context canceled
- drained stale messages from reader channel during keep-alive resets.

Root Cause

Session.serve processes every control-plane message in a single goroutine. Before the fix it handled triggerComponent synchronously, so it waited for comp.Check() to finish before writing any response.
components/disk/component.go retries disk.GetPartitions up to five times per tracked mount, each wrapped in a one-minute timeout. A flaky /mnt/nfs-share can therefore block Check() for roughly five minutes per mount.
While Check() blocks, the writer goroutine cannot drain the 20-item s.writer channel. The control plane eventually cancels the streaming HTTP request, triggering the session reader/session writer error messages and the drained stale messages warning when the keep-alive loop tears everything down.
The disk component's background ticker keeps updating lastCheckResult, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes.

One manual triggerComponent aimed at the disk component is sufficient to deadlock the session loop. The request is enqueued, Session.serve synchronously calls the long-running disk.Check(), and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot.

Fix

triggerComponent / triggerComponentCheck now capture the target component names, enqueue an immediate acknowledgement (with empty States placeholders) back to the control plane, and launch the expensive comp.Check() work in a separate goroutine. This keeps the session writer responsive even when disk checks spend minutes retrying disk.GetPartitions.

… checks A manual `triggerComponent`/`triggerComponentCheck` request against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data. **Issue** - Control-plane machine object continued to show `nfs` health at `2025-09-18T05:48:16Z`, even though GPUd's local disk state had fresh timestamps (for example `2025-09-18T12:07:16Z`). - GPUd logs repeated: - `session reader: error decoding response: EOF` - `session writer: error making request ... context canceled` - `drained stale messages from reader channel` during keep-alive resets. **Root Cause** 1. `Session.serve` processes every control-plane message in a single goroutine. Before the fix it handled `triggerComponent` synchronously, so it waited for `comp.Check()` to finish before writing any response. 2. `components/disk/component.go` retries `disk.GetPartitions` up to **five** times per tracked mount, each wrapped in a **one-minute** timeout. A flaky `/mnt/nfs-share` can therefore block `Check()` for roughly five minutes per mount. 3. While `Check()` blocks, the writer goroutine cannot drain the 20-item `s.writer` channel. The control plane eventually cancels the streaming HTTP request, triggering the `session reader`/`session writer` error messages and the `drained stale messages` warning when the keep-alive loop tears everything down. 4. The disk component's background ticker keeps updating `lastCheckResult`, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes. **One** manual `triggerComponent` aimed at the disk component is sufficient to deadlock the session loop. The request is enqueued, `Session.serve` synchronously calls the long-running `disk.Check()`, and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot. **Fix** - `triggerComponent` / `triggerComponentCheck` now capture the target component names, enqueue an immediate acknowledgement (with empty `States` placeholders) back to the control plane, and launch the expensive `comp.Check()` work in a separate goroutine. This keeps the session writer responsive even when disk checks spend minutes retrying `disk.GetPartitions`. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

codecov · 2025-09-18T14:44:31Z

Codecov Report

❌ Patch coverage is 69.44444% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.29%. Comparing base (70316cf) to head (84e2f4b).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/session/serve.go	69.44%	8 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1078      +/-   ##
==========================================
+ Coverage   67.27%   67.29%   +0.02%     
==========================================
  Files         316      316              
  Lines       26312    26342      +30     
==========================================
+ Hits        17702    17728      +26     
- Misses       7726     7729       +3     
- Partials      884      885       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…oid blocking the main serve loop (#1082) A manual `triggerComponent`/`triggerComponentCheck` request against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data. **Issue** - Control-plane machine object continued to show `nfs` health at `2025-09-18T05:48:16Z`, even though GPUd's local disk state had fresh timestamps (for example `2025-09-18T12:07:16Z`). - GPUd logs repeated: - `session reader: error decoding response: EOF` - `session writer: error making request ... context canceled` - `drained stale messages from reader channel` during keep-alive resets. **Root Cause** 1. `Session.serve` processes every control-plane message in a single goroutine. Before the fix it handled `triggerComponent` synchronously, so it waited for `comp.Check()` to finish before writing any response. 2. `components/disk/component.go` retries `disk.GetPartitions` up to **five** times per tracked mount, each wrapped in a **one-minute** timeout. A flaky `/mnt/nfs-share` can therefore block `Check()` for roughly five minutes per mount. 3. While `Check()` blocks, the writer goroutine cannot drain the 20-item `s.writer` channel. The control plane eventually cancels the streaming HTTP request, triggering the `session reader`/`session writer` error messages and the `drained stale messages` warning when the keep-alive loop tears everything down. 4. The disk component's background ticker keeps updating `lastCheckResult`, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes. **One** manual `triggerComponent` aimed at the disk component is sufficient to deadlock the session loop. The request is enqueued, `Session.serve` synchronously calls the long-running `disk.Check()`, and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot. **Fix** - `triggerComponent` will be processed asynchronously in a background goroutine. - The response will be written back to the control plane once complete, with the same ReqID. c.f., #1078 --------- Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

gyuho added this to the v0.8.0 milestone Sep 18, 2025

gyuho self-assigned this Sep 18, 2025

gyuho force-pushed the LEP-2083 branch 2 times, most recently from 60c0cfe to b442da6 Compare September 18, 2025 14:36

gyuho force-pushed the LEP-2083 branch from b442da6 to 84e2f4b Compare September 18, 2025 14:38

gyuho mentioned this pull request Sep 23, 2025

[LEP-2083] fix(session): async triggerComponent response writes to avoid blocking the main serve loop #1082

Merged

gyuho added the wip - do not merge working in progress label Sep 23, 2025

gyuho closed this Sep 23, 2025

gyuho deleted the LEP-2083 branch September 23, 2025 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LEP-2083] fix(session): async "triggerComponent" to prevent blocking checks #1078

[LEP-2083] fix(session): async "triggerComponent" to prevent blocking checks #1078

Uh oh!

gyuho commented Sep 18, 2025 •

edited

Loading

Uh oh!

codecov bot commented Sep 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[LEP-2083] fix(session): async "triggerComponent" to prevent blocking checks #1078

[LEP-2083] fix(session): async "triggerComponent" to prevent blocking checks #1078

Uh oh!

Conversation

gyuho commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gyuho commented Sep 18, 2025 •

edited

Loading

codecov bot commented Sep 18, 2025 •

edited

Loading