Skip to content

Ensure consistency across all code paths that write to the install dataset #8507

Open
@sunshowers

Description

@sunshowers

tl;dr: we should be consistent across all the different ways control plane zones can be deployed to the install dataset. This is not a blocker for the initial self-service update milestone, but is tech debt that we should repay at some point.


As part of RFD 556, we've created a couple of metadata files to manage the install dataset:

  1. The mupdate override file (mupdate-override.json) which indicates that the install dataset has recently been updated.
  2. The zone manifest (zones.json) which has a list of all the control plane zones and their hashes.

On production systems, the only way to update the install dataset is via a MUPdate, using Installinator and Wicket. This code path always writes both of these files out.

On test systems, there are at least three different ways to update the install dataset:

  1. Via a MUPdate using Installinator and Wicket, which always writes these files out.
  2. Through rackletteadm, which writes files to the install dataset directly in some situations. Today, direct writes to the install dataset do not cause either the zone manifest or the mupdate override file to be written out.
  3. Within the a4x2 setup, which also doesn't write either file out.

This issue is about making sure all of these code paths are consistent in how they write zones to the install dataset.

Why is this important?

As described in RFD 556 , these two metadata files play a critical role in determining Reconfigurator planner behavior.

To recap: Let's say that an existing system has zones that are served from the artifact store. If a MUPdate happens at this point, the operator has indicated a desire to use the install dataset over the artifact store. When that happens:

  1. Sled Agent reads the mupdate override file's contents, and if it is present it honors the override, ignoring the artifact store.
  2. Sled Agent informs Nexus of the override as part of the next inventory collection.
  3. Within Nexus, the Reconfigurator planner updates the system blueprint to reflect reality, changing out that sled's zone image sources to being the blueprint. Simultaneously, it instructs Sled Agent to clear out the mupdate override file.
  4. Once Sled Agent has removed the override file, this fact is reported back to Nexus via an inventory collection.
  5. The Reconfigurator planner can then update its blueprint, setting that sled's zone image sources to using the artifact store, as long as the target release and install dataset hashes match.

After step 5, the system has recovered and Reconfigurator is back in charge.

In the above sequence, steps 1 through 4 depend on the presence of the mupdate override file, and step 5 depends on the zone manifest. If these files are missing:

  • If the mupdate override file is missing:
    • the above state machine will never be started, and the operator's desire to use the install dataset over the artifact store will not be honored
    • once #8486 lands, the planner will move to replace install-dataset zones with ones from the artifact store whenever the hashes match. This behavior is disabled when a mupdate override is in place, but it will happen if the mupdate override is missing.
  • If the zone manifest is missing, step 5 cannot take place.

In the interest of urgency, if the zone manifest is missing, we've chosen to synthesize one based on the existing zone images on disk rather than produce an error. That is a good enough workaround for now.

So the remainder of this issue will talk about the mupdate override file, which cannot be synthesized.

Analysis of code paths

As mentioned above, there are at least three different code paths that write out zone images to the install dataset.

Installinator and Wicket

This code path is the only supported method on production systems. It is also used in some test environments:

  • On the dogfood rack, when following the dogfood script
  • Via rackletteadm, if:
    • doing an upgrade, or
    • a clean-slate install with --install-with-one-mupdate (source), on one scrimlet

With these code paths, these files are always written out, so everything works correctly.

Rackletteadm direct

This code path is used by rackletteadm when:

  • doing a clean-slate install, on all sleds (source)
  • doing a clean-slate install with --install-with-one-mupdate, on all sleds except for one scrimlet (source)

In these cases, the mupdate override file is not currently written out. But note that for a clean-slate install, rackletteadm runs RSS, which sets zone image sources to using the install dataset anyway. So step 1 (honoring the mupdate override) is not directly relevant.

It will be the case, though, that when all of the following conditions are met:

  • a clean-slate install is performed with --install-with-one-mupdate
  • #8486 lands
  • a target release is then set on the system to the same version the clean-slate install is performed with

There will be a divergence in behavior between the scrimlet that was MUPdated and the sleds that were not. This divergence will be reflected in the planner setting zone image sources to the artifact store on the other sleds, but not on the scrimlet.

This is a consideration worth noting, though likely not one that blocks our initial release.

a4x2

This code path is used by a4x2. This path doesn't write out the mupdate override file either.

In the a4x2 environment, if the install dataset is changed after an online update, the install dataset may not be honored.

(This should be okay because I don't think we're ever changing the install dataset after online updates in a4x2—usually, we're doing clean-slate installs. But it's worth noting.)

Other code paths

(what else are we not covering here?)

Ideas

The difference in behavior is acceptable for now, but is tech debt that is worth addressing at some point. There are a few different ways to solve this:

  • Use MUPdates in more situations. The issue with this is that MUPdates are relatively slow since they must go over the UART (RFD 345). When possible, it would be nice to not have to do that.
  • Change code paths to also write these files out. The mupdate-override.json file in particular is quite straightforward, and can be written out from a shell script with a tiny amount of string interpolation.
  • Adapt Installinator to write out zones. Installinator can be used as a standalone binary to write out zones to install datasets (this is how we use it in integration tests). In principle, it could gain the ability to also be run in the rackletteadm and a4x2 environments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions