Skip to content

Conversation

@gyuho
Copy link
Member

@gyuho gyuho commented Aug 22, 2025

  • add "gpud run --plugins-init-fail-fast" flag (default false)
  • retry on init plugin failures
  • update spec comparison logic for empty spec edge cases

By default, GPUd will proceed to join the cluster (or initialize the server)
even if init plugins return unhealthy state (e.g., command exit 1).

Retry on init plugin failures.

@gyuho gyuho added this to the v0.7.0 milestone Aug 22, 2025
@gyuho gyuho self-assigned this Aug 22, 2025
@codecov
Copy link

codecov bot commented Aug 22, 2025

Codecov Report

❌ Patch coverage is 26.47059% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.68%. Comparing base (b787964) to head (13e10d9).

Files with missing lines Patch % Lines
pkg/server/server.go 0.00% 11 Missing ⚠️
cmd/gpud/command/command.go 0.00% 4 Missing ⚠️
pkg/session/serve.go 0.00% 3 Missing and 1 partial ⚠️
pkg/session/session.go 42.85% 4 Missing ⚠️
cmd/gpud/run/command.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1051      +/-   ##
==========================================
- Coverage   66.71%   66.68%   -0.04%     
==========================================
  Files         314      314              
  Lines       26560    26588      +28     
==========================================
+ Hits        17720    17729       +9     
- Misses       7955     7971      +16     
- Partials      885      888       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gyuho gyuho force-pushed the LEP-1814 branch 2 times, most recently from 08e8b9d to 6c25680 Compare August 22, 2025 09:40
@gyuho gyuho changed the title [LEP-1814] feat(gpud): add "gpud run --plugins-init-fail-fast" flag (default false) [LEP-1814] feat(gpud): add "gpud run --plugins-init-fail-fast" flag (default false), retry on init plugin failures Aug 22, 2025
@gyuho gyuho changed the title [LEP-1814] feat(gpud): add "gpud run --plugins-init-fail-fast" flag (default false), retry on init plugin failures [LEP-1814] feat(gpud): make plugin specs update/install more reliable Aug 22, 2025
- add "gpud run --plugins-init-fail-fast" flag (default false)
- retry on init plugin failures
- update spec comparison logic for empty spec edge cases

By default, GPUd will proceed to join the cluster (or initialize the server)
even if init plugins return unhealthy state (e.g., command exit 1).

Retry on init plugin failures.

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
@gyuho gyuho added the wip - do not merge working in progress label Aug 23, 2025
@gyuho gyuho modified the milestones: v0.7.0, v0.8.0 Aug 23, 2025
@gyuho gyuho closed this Aug 23, 2025
@gyuho gyuho deleted the LEP-1814 branch August 23, 2025 04:10
gyuho added a commit that referenced this pull request Aug 25, 2025
…ec are none (#1052)

Cherry-pick-ed from #1051.

Signed-off-by: Gyuho Lee <gyuhol@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wip - do not merge working in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant