-
Notifications
You must be signed in to change notification settings - Fork 54
[LEP-2083] fix(session): async triggerComponent response writes to avoid blocking the main serve loop #1082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+3,569
−1,310
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
eahydra
reviewed
Sep 23, 2025
eahydra
approved these changes
Sep 23, 2025
…oid blocking the main serve loop A manual `triggerComponent`/`triggerComponentCheck` request against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data. **Issue** - Control-plane machine object continued to show `nfs` health at `2025-09-18T05:48:16Z`, even though GPUd's local disk state had fresh timestamps (for example `2025-09-18T12:07:16Z`). - GPUd logs repeated: - `session reader: error decoding response: EOF` - `session writer: error making request ... context canceled` - `drained stale messages from reader channel` during keep-alive resets. **Root Cause** 1. `Session.serve` processes every control-plane message in a single goroutine. Before the fix it handled `triggerComponent` synchronously, so it waited for `comp.Check()` to finish before writing any response. 2. `components/disk/component.go` retries `disk.GetPartitions` up to **five** times per tracked mount, each wrapped in a **one-minute** timeout. A flaky `/mnt/nfs-share` can therefore block `Check()` for roughly five minutes per mount. 3. While `Check()` blocks, the writer goroutine cannot drain the 20-item `s.writer` channel. The control plane eventually cancels the streaming HTTP request, triggering the `session reader`/`session writer` error messages and the `drained stale messages` warning when the keep-alive loop tears everything down. 4. The disk component's background ticker keeps updating `lastCheckResult`, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes. **One** manual `triggerComponent` aimed at the disk component is sufficient to deadlock the session loop. The request is enqueued, `Session.serve` synchronously calls the long-running `disk.Check()`, and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot. **Fix** - `triggerComponent` will be processed asynchronously in a background goroutine. - The response will be written back to the control plane once complete, with the same ReqID. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1082 +/- ##
==========================================
+ Coverage 67.12% 67.42% +0.29%
==========================================
Files 315 328 +13
Lines 26164 26260 +96
==========================================
+ Hits 17563 17706 +143
+ Misses 7717 7668 -49
- Partials 884 886 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A manual
triggerComponent/triggerComponentCheckrequest against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data.Issue
nfshealth at2025-09-18T05:48:16Z, even though GPUd's local disk state had fresh timestamps (for example2025-09-18T12:07:16Z).session reader: error decoding response: EOFsession writer: error making request ... context canceleddrained stale messages from reader channelduring keep-alive resets.Root Cause
Session.serveprocesses every control-plane message in a single goroutine. Before the fix it handledtriggerComponentsynchronously, so it waited forcomp.Check()to finish before writing any response.components/disk/component.goretriesdisk.GetPartitionsup to five times per tracked mount, each wrapped in a one-minute timeout. A flaky/mnt/nfs-sharecan therefore blockCheck()for roughly five minutes per mount.Check()blocks, the writer goroutine cannot drain the 20-items.writerchannel. The control plane eventually cancels the streaming HTTP request, triggering thesession reader/session writererror messages and thedrained stale messageswarning when the keep-alive loop tears everything down.lastCheckResult, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes.One manual
triggerComponentaimed at the disk component is sufficient to deadlock the session loop. The request is enqueued,Session.servesynchronously calls the long-runningdisk.Check(), and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot.Fix
triggerComponentwill be processed asynchronously in a background goroutine.c.f., #1078