- 
                Notifications
    You must be signed in to change notification settings 
- Fork 54
[LEP-2083] fix(session): async "triggerComponent" to prevent blocking checks #1078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    60c0cfe    to
    b442da6      
    Compare
  
    … checks A manual `triggerComponent`/`triggerComponentCheck` request against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data. **Issue** - Control-plane machine object continued to show `nfs` health at `2025-09-18T05:48:16Z`, even though GPUd's local disk state had fresh timestamps (for example `2025-09-18T12:07:16Z`). - GPUd logs repeated: - `session reader: error decoding response: EOF` - `session writer: error making request ... context canceled` - `drained stale messages from reader channel` during keep-alive resets. **Root Cause** 1. `Session.serve` processes every control-plane message in a single goroutine. Before the fix it handled `triggerComponent` synchronously, so it waited for `comp.Check()` to finish before writing any response. 2. `components/disk/component.go` retries `disk.GetPartitions` up to **five** times per tracked mount, each wrapped in a **one-minute** timeout. A flaky `/mnt/nfs-share` can therefore block `Check()` for roughly five minutes per mount. 3. While `Check()` blocks, the writer goroutine cannot drain the 20-item `s.writer` channel. The control plane eventually cancels the streaming HTTP request, triggering the `session reader`/`session writer` error messages and the `drained stale messages` warning when the keep-alive loop tears everything down. 4. The disk component's background ticker keeps updating `lastCheckResult`, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes. **One** manual `triggerComponent` aimed at the disk component is sufficient to deadlock the session loop. The request is enqueued, `Session.serve` synchronously calls the long-running `disk.Check()`, and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot. **Fix** - `triggerComponent` / `triggerComponentCheck` now capture the target component names, enqueue an immediate acknowledgement (with empty `States` placeholders) back to the control plane, and launch the expensive `comp.Check()` work in a separate goroutine. This keeps the session writer responsive even when disk checks spend minutes retrying `disk.GetPartitions`. Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
| Codecov Report❌ Patch coverage is  
 Additional details and impacted files@@            Coverage Diff             @@
##             main    #1078      +/-   ##
==========================================
+ Coverage   67.27%   67.29%   +0.02%     
==========================================
  Files         316      316              
  Lines       26312    26342      +30     
==========================================
+ Hits        17702    17728      +26     
- Misses       7726     7729       +3     
- Partials      884      885       +1     ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
 | 
    
  gyuho 
      added a commit
      that referenced
      this pull request
    
      Sep 23, 2025 
    
    
      
  
    
      
    
  
…oid blocking the main serve loop (#1082) A manual `triggerComponent`/`triggerComponentCheck` request against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data. **Issue** - Control-plane machine object continued to show `nfs` health at `2025-09-18T05:48:16Z`, even though GPUd's local disk state had fresh timestamps (for example `2025-09-18T12:07:16Z`). - GPUd logs repeated: - `session reader: error decoding response: EOF` - `session writer: error making request ... context canceled` - `drained stale messages from reader channel` during keep-alive resets. **Root Cause** 1. `Session.serve` processes every control-plane message in a single goroutine. Before the fix it handled `triggerComponent` synchronously, so it waited for `comp.Check()` to finish before writing any response. 2. `components/disk/component.go` retries `disk.GetPartitions` up to **five** times per tracked mount, each wrapped in a **one-minute** timeout. A flaky `/mnt/nfs-share` can therefore block `Check()` for roughly five minutes per mount. 3. While `Check()` blocks, the writer goroutine cannot drain the 20-item `s.writer` channel. The control plane eventually cancels the streaming HTTP request, triggering the `session reader`/`session writer` error messages and the `drained stale messages` warning when the keep-alive loop tears everything down. 4. The disk component's background ticker keeps updating `lastCheckResult`, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes. **One** manual `triggerComponent` aimed at the disk component is sufficient to deadlock the session loop. The request is enqueued, `Session.serve` synchronously calls the long-running `disk.Check()`, and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot. **Fix** - `triggerComponent` will be processed asynchronously in a background goroutine. - The response will be written back to the control plane once complete, with the same ReqID. c.f., #1078 --------- Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
A manual
triggerComponent/triggerComponentCheckrequest against the disk component blocked GPUd's control-plane session loop. While the disk health probe sat in a long retry cycle, no responses were written back to the control plane, so GPUd appeared healthy locally but the control plane stayed stuck on stale health data.Issue
nfshealth at2025-09-18T05:48:16Z, even though GPUd's local disk state had fresh timestamps (for example2025-09-18T12:07:16Z).session reader: error decoding response: EOFsession writer: error making request ... context canceleddrained stale messages from reader channelduring keep-alive resets.Root Cause
Session.serveprocesses every control-plane message in a single goroutine. Before the fix it handledtriggerComponentsynchronously, so it waited forcomp.Check()to finish before writing any response.components/disk/component.goretriesdisk.GetPartitionsup to five times per tracked mount, each wrapped in a one-minute timeout. A flaky/mnt/nfs-sharecan therefore blockCheck()for roughly five minutes per mount.Check()blocks, the writer goroutine cannot drain the 20-items.writerchannel. The control plane eventually cancels the streaming HTTP request, triggering thesession reader/session writererror messages and thedrained stale messageswarning when the keep-alive loop tears everything down.lastCheckResult, so GPUd's local state reflects the latest degraded result. However, the control plane never receives that update because the synchronous response never flushes.One manual
triggerComponentaimed at the disk component is sufficient to deadlock the session loop. The request is enqueued,Session.servesynchronously calls the long-runningdisk.Check(), and the writer never sends a response. When the control plane cancels the stuck request, GPUd restarts the session and the control plane still holds the last good health snapshot.Fix
triggerComponent/triggerComponentChecknow capture the target component names, enqueue an immediate acknowledgement (with emptyStatesplaceholders) back to the control plane, and launch the expensivecomp.Check()work in a separate goroutine. This keeps the session writer responsive even when disk checks spend minutes retryingdisk.GetPartitions.