[LEP-2083] fix(session): async triggerComponent response writes to avoid blocking the main serve loop #1082

gyuho · 2025-09-23T06:56:10Z

A manual triggerComponent/triggerComponentCheck request against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data.

Issue

Control-plane machine object continued to show nfs health at 2025-09-18T05:48:16Z, even though GPUd's local disk state had fresh timestamps (for example 2025-09-18T12:07:16Z).
GPUd logs repeated:
- session reader: error decoding response: EOF
- session writer: error making request ... context canceled
- drained stale messages from reader channel during keep-alive resets.

Root Cause

Session.serve processes every control-plane message in a single goroutine. Before the fix it handled triggerComponent synchronously, so it waited for comp.Check() to finish before writing any response.
components/disk/component.go retries disk.GetPartitions up to five times per tracked mount, each wrapped in a one-minute timeout. A flaky /mnt/nfs-share can therefore block Check() for roughly five minutes per mount.
While Check() blocks, the writer goroutine cannot drain the 20-item s.writer channel. The control plane eventually cancels the streaming HTTP request, triggering the session reader/session writer error messages and the drained stale messages warning when the keep-alive loop tears everything down.
The disk component's background ticker keeps updating lastCheckResult, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes.

One manual triggerComponent aimed at the disk component is sufficient to deadlock the session loop. The request is enqueued, Session.serve synchronously calls the long-running disk.Check(), and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot.

Fix

triggerComponent will be processed asynchronously in a background goroutine.
The response will be written back to the control plane once complete, with the same ReqID.

c.f., #1078

pkg/session/serve.go

…oid blocking the main serve loop A manual `triggerComponent`/`triggerComponentCheck` request against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data. **Issue** - Control-plane machine object continued to show `nfs` health at `2025-09-18T05:48:16Z`, even though GPUd's local disk state had fresh timestamps (for example `2025-09-18T12:07:16Z`). - GPUd logs repeated: - `session reader: error decoding response: EOF` - `session writer: error making request ... context canceled` - `drained stale messages from reader channel` during keep-alive resets. **Root Cause** 1. `Session.serve` processes every control-plane message in a single goroutine. Before the fix it handled `triggerComponent` synchronously, so it waited for `comp.Check()` to finish before writing any response. 2. `components/disk/component.go` retries `disk.GetPartitions` up to **five** times per tracked mount, each wrapped in a **one-minute** timeout. A flaky `/mnt/nfs-share` can therefore block `Check()` for roughly five minutes per mount. 3. While `Check()` blocks, the writer goroutine cannot drain the 20-item `s.writer` channel. The control plane eventually cancels the streaming HTTP request, triggering the `session reader`/`session writer` error messages and the `drained stale messages` warning when the keep-alive loop tears everything down. 4. The disk component's background ticker keeps updating `lastCheckResult`, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes. **One** manual `triggerComponent` aimed at the disk component is sufficient to deadlock the session loop. The request is enqueued, `Session.serve` synchronously calls the long-running `disk.Check()`, and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot. **Fix** - `triggerComponent` will be processed asynchronously in a background goroutine. - The response will be written back to the control plane once complete, with the same ReqID. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

codecov · 2025-09-23T15:18:22Z

Codecov Report

❌ Patch coverage is 81.14286% with 132 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.42%. Comparing base (3a92838) to head (7050935).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/session/logout.go	0.00%	26 Missing ⚠️
pkg/session/package_status.go	0.00%	25 Missing ⚠️
pkg/session/session_process_request.go	82.40%	17 Missing and 5 partials ⚠️
pkg/session/update.go	38.23%	18 Missing and 3 partials ⚠️
pkg/session/deregister.go	63.33%	9 Missing and 2 partials ⚠️
pkg/session/health_states.go	88.04%	9 Missing and 2 partials ⚠️
pkg/session/session_serve.go	81.35%	10 Missing and 1 partial ⚠️
pkg/session/bootstrap.go	86.95%	2 Missing and 1 partial ⚠️
pkg/session/inject_fault.go	91.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1082      +/-   ##
==========================================
+ Coverage   67.12%   67.42%   +0.29%     
==========================================
  Files         315      328      +13     
  Lines       26164    26260      +96     
==========================================
+ Hits        17563    17706     +143     
+ Misses       7717     7668      -49     
- Partials      884      886       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

gyuho added this to the v0.8.0 milestone Sep 23, 2025

gyuho self-assigned this Sep 23, 2025

gyuho force-pushed the LEP-2083-2 branch from 21a012b to 336bf89 Compare September 23, 2025 07:03

eahydra reviewed Sep 23, 2025

View reviewed changes

pkg/session/serve.go Outdated Show resolved Hide resolved

gyuho force-pushed the LEP-2083-2 branch from c440fb2 to 230415a Compare September 23, 2025 11:00

eahydra approved these changes Sep 23, 2025

View reviewed changes

gyuho force-pushed the LEP-2083-2 branch from 230415a to f0f9c8e Compare September 23, 2025 13:28

gyuho mentioned this pull request Sep 23, 2025

fix(session): race conditions in unit tests #1077

Closed

gyuho added 2 commits September 23, 2025 21:42

make processRequest generic

d68d41f

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

gyuho force-pushed the LEP-2083-2 branch from f0f9c8e to cb907d9 Compare September 23, 2025 15:11

leptonai deleted a comment from codecov bot Sep 23, 2025

fix

7050935

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>

gyuho force-pushed the LEP-2083-2 branch from cb907d9 to 7050935 Compare September 23, 2025 15:21

gyuho merged commit 1b82a1c into main Sep 23, 2025
12 of 13 checks passed

gyuho deleted the LEP-2083-2 branch September 23, 2025 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LEP-2083] fix(session): async triggerComponent response writes to avoid blocking the main serve loop #1082

[LEP-2083] fix(session): async triggerComponent response writes to avoid blocking the main serve loop #1082

gyuho commented Sep 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

codecov bot commented Sep 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[LEP-2083] fix(session): async triggerComponent response writes to avoid blocking the main serve loop #1082

[LEP-2083] fix(session): async triggerComponent response writes to avoid blocking the main serve loop #1082

Conversation

gyuho commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gyuho commented Sep 23, 2025 •

edited

Loading

codecov bot commented Sep 23, 2025 •

edited

Loading