[29/n] blueprint planner logic for mupdate overrides #8456

sunshowers · 2025-06-26T05:26:07Z

Start implementing the blueprint planner logic for mupdate overrides.

There's still a lot of work to do:

Ensure the logic is sound, especially around errors, missing inventory, etc.
~~Move zone image sources back to artifacts once we know they've been distributed to a sled. (How do we know they've been distributed to a sled?)~~ Done in [24/n] [reconfigurator-planning] support no-op image source updates #8486.
Is the set of reasons for why not to proceed with other steps accurate? To me it does seem like we need to wait until both conditions are met, but I'd like to check.
Lots more tests.
I don't think this can land before the logic to reset the mupdate override field in sled-agent lands. I also think this may have to land simultaneously with the code to redirect to the install dataset within sled-agent.
- We've decided to land this first, with the understanding that blueprint planning will be a bit broken since sled-agent doesn't clear the mupdate override field yet. I plan to work on clearing the mupdate override field next, and hope to get both approved and land both roughly simultaneously.

Depends on:

[24/n] [reconfigurator-planning] support no-op image source updates #8486
[27/n] [reconfigurator-cli] allow specifying "latest" for collection ID #8536
[28/n] [blippy/sled-agent] add remove_mupdate_override checks #8561
everything before that in the stack

Created using spr 1.3.6-beta.1

sunshowers · 2025-06-26T05:50:22Z

nexus/reconfigurator/planning/src/planner.rs

+        let mut sleds_with_override = BTreeSet::new();
+        for sled_id in self.input.all_sled_ids(SledFilter::InService) {


Thinking about this, I'm wondering what would happen here if a sled goes away in the middle of this process, disappearing from inventory. In that case the remove_mupdate_override field never gets cleared from the blueprint.

We would do:

expunge the sled in the first blueprint

when executed, the sled policy will be updated to Expunged in the planning input

then next planning cycle it'll no longer be in the InService set

So I think we'll eventually converge -- it'll just take a couple cycles. (A TODO is to add a test for this.)

Tested this with the new inventory-hidden and inventory-visible subcommands in reconfigurator-cli.

sunshowers · 2025-06-26T05:56:14Z

nexus/reconfigurator/planning/src/blueprint_editor/sled_editor.rs

+                    let old_image_source = self.zones.set_zone_image_source(
+                        &zone_id,
+                        BlueprintZoneImageSource::InstallDataset,
+                    )?;


RFD 556 says:

Wherever the planner uses the target release, it is instead ignored if its generation number is not greater than min_release_generation (if set).

As discussed in Tuesday's watercooler it's a bit more complex than that -- what we want to do is to only use the install dataset on sleds that have been mupdated, since on other sleds the install dataset may be woefully out of date.

I think I want to make the claim that this code may actually be sufficient as it stands. I don't think we need to try and do any other redirects other than this one (which is admittedly edge-triggered), as long as we prevent new zones from being set up at all while the system is recovering from the mupdate.

We decided to not proceed with adding new zones until the mupdate override has been completely cleared.

jgallagher · 2025-06-26T15:35:31Z

nexus/reconfigurator/planning/src/blueprint_editor/sled_editor.rs

+                // override that was set in the above branch. We can remove the
+                // override from the blueprint.
+                self.set_remove_mupdate_override(None);
+                // TODO: change zone sources from InstallDataset to Artifact


I'm not sure we'll need to do this here; the normal upgrade path should change zone sources already, right? (It just needs to not do that while a mupdate override is in place.)

Although maybe sled-agent should do something like "if I'm changing from install dataset with hash X to artifact with hash X, don't actually bounce the zone".

I'm not sure we'll need to do this here; the normal upgrade path should change zone sources already, right? (It just needs to not do that while a mupdate override is in place.)

Good question -- we do this one zone at a time currently, and I guess this would be an opportunity to do a bulk replace. (But why not always do a level-triggered bulk replace?)

Although maybe sled-agent should do something like "if I'm changing from install dataset with hash X to artifact with hash X, don't actually bounce the zone".

Yeah, this is reasonable.

Resolved in #8486, and TODO removed.

jgallagher · 2025-06-26T15:48:22Z

nexus/reconfigurator/planning/src/blueprint_editor/sled_editor.rs

+                    );
+                }
+
+                // TODO: Do the same for RoT/SP/host OS.


I think this will be as simple as:

clear any PendingMgsUpdates for this sled

change the host phase 2 in the OmicronSledConfig to the equivalent of InstallDataset for zones (this doesn't exist yet but will be coming soon)

What I'm less sure about is what happens if there are PendingMgsUpdates in the current target blueprint concurrently with a mupdate happening to that sled. Maybe wicket and Nexus end up dueling? If the mupdate completes and changes the contents of any of the target slots Nexus's prechecks should start failing, but if the mupdate happens to not change the target slots, maybe the prechecks still pass and Nexus starts trying to update it again as soon as it comes online?

I haven't done this yet -- worth discussing in the watercooler tomorrow?

jgallagher · 2025-06-26T15:54:41Z

nexus/reconfigurator/planning/src/planner.rs

+            // If do_plan_mupdate_override returns Waiting, we don't plan *any*
+            // additional steps until the system has recovered.
+            self.do_plan_add()?;
+            self.do_plan_decommission()?;


I think we could still decommission things if there's a mupdate override in place? This only acts on sleds or disks that an operator has explicitly told us is gone, and is basically a followup to do_plan_expunge(). (Maybe this step should be ordered before do_sled_add() anyway? I don't think there are any dependencies between them...)

Yep -- done.

jgallagher · 2025-06-26T15:57:47Z

nexus/reconfigurator/planning/src/planner.rs

+                    // generation table -- one of the invariants of the target
+                    // release generation is that it only moves forward.
+                    //
+                    // In this case we warn but set the value.


This doesn't seem right; we should probably bail out of planning entirely in this case, right? This seems like an "I don't know what's going on in the world" kind of thing that in a simpler system we'd assert on?

Yeah -- done.

In #8456 we'll block the `do_plan_add` step on whether the system is currently recovering from a mupdate override. But there's no reason to block the `do_plan_decommission` step on that. This is easiest expressed by moving decommission to before add.

Created using spr 1.3.6-beta.1

) `HostPhase2DesiredContents` is analogous to `OmicronZoneImageSource`, but for OS images: either keep the current contents of the boot disk or set it to a specific artifact from the TUF repo depot. "Keep the current contents" should show up in three cases, just like `OmicronZoneImageSource::InstallDataset`: 1. It's the default value for deserializing, so we can load old configs that didn't have this value 2. RSS uses it (no TUF repo depot involved at this point) 3. The planner will use this variant as a part of removing a mupdate override (this work is still in PR itself: #8456 (comment))

Created using spr 1.3.6-beta.1

sunshowers · 2025-07-10T03:19:37Z

@jgallagher this is ready for you to look at again -- have added clearing the pending MGS update. I'm going to try and land this simultaneously with the sled-agent changes to clear the mupdate override, though, because by itself it will cause the planner to not do anything after the mupdate occurs.

Created using spr 1.3.6-beta.1

jgallagher · 2025-07-11T14:59:29Z

nexus/reconfigurator/planning/src/blueprint_editor/sled_editor.rs

+                    Entry::Occupied(entry) => Some(Box::new(entry.remove())),
+                };
+
+                // TODO: Do the same for host OS.


Could you add a reference to #8542 here? I'm assuming this will land before #8570, and that'll help me find all the spots I need to fixup there.

jgallagher · 2025-07-11T15:02:02Z

nexus/reconfigurator/planning/src/planner.rs

+        }
+
+        // Now we need to determine whether to also perform other actions like
+        // updating or adding zones. We have to be careful here:


This is an excellent comment; thanks!

[spr] initial version

ba8676a

Created using spr 1.3.6-beta.1

sunshowers marked this pull request as draft June 26, 2025 05:26

clippy

97e3c59

Created using spr 1.3.6-beta.1

sunshowers commented Jun 26, 2025

View reviewed changes

jgallagher reviewed Jun 26, 2025

View reviewed changes

plotnick mentioned this pull request Jun 26, 2025

Planner wait conditions #8453

Closed

This was referenced Jun 26, 2025

sled-agent should not bounce zones if the image source changes but their hashes match #8463

Closed

[20/n] [reconfigurator-planning] do decommission before add #8464

Merged

sunshowers changed the title ~~[wip] [20/n] blueprint planner logic for mupdate overrides~~ [wip] [23/n] blueprint planner logic for mupdate overrides Jun 27, 2025

sunshowers changed the title ~~[wip] [23/n] blueprint planner logic for mupdate overrides~~ [wip] [??/n] blueprint planner logic for mupdate overrides Jul 1, 2025

rebase, mostly ready for review

3638cf8

Created using spr 1.3.6-beta.1

sunshowers marked this pull request as ready for review July 8, 2025 05:15

sunshowers changed the title ~~[wip] [??/n] blueprint planner logic for mupdate overrides~~ [28/n] blueprint planner logic for mupdate overrides Jul 8, 2025

jgallagher mentioned this pull request Jul 8, 2025

Add desired host phase 2 contents to OmicronSledConfig (PR 1/4) #8538

Merged

rebase on 28, comments, pending MGS updates

99668b1

Created using spr 1.3.6-beta.1

sunshowers added 2 commits July 10, 2025 03:30

update comment

e506a84

Created using spr 1.3.6-beta.1

clippy

ca33786

Created using spr 1.3.6-beta.1

sunshowers changed the title ~~[28/n] blueprint planner logic for mupdate overrides~~ [29/n] blueprint planner logic for mupdate overrides Jul 10, 2025

sunshowers requested a review from andrewjstone July 10, 2025 22:49

jgallagher approved these changes Jul 11, 2025

View reviewed changes

		let mut sleds_with_override = BTreeSet::new();
		for sled_id in self.input.all_sled_ids(SledFilter::InService) {

[29/n] blueprint planner logic for mupdate overrides #8456

Are you sure you want to change the base?

[29/n] blueprint planner logic for mupdate overrides #8456

Uh oh!

Conversation

sunshowers commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunshowers Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunshowers Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunshowers Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunshowers commented Jul 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sunshowers commented Jun 26, 2025 •

edited

Loading

sunshowers Jun 26, 2025 •

edited

Loading

sunshowers Jun 26, 2025 •

edited

Loading

sunshowers Jun 26, 2025 •

edited

Loading