Update bootstore README (#3779)

andrewjstone · web-flow · commit 92467bd41864 · 2023-07-27T19:04:51.000Z
diff --git a/bootstore/README.adoc b/bootstore/README.adoc
@@ -14,6 +14,9 @@ each described by a zero-sized struct that can be debug printed as necessary.
 You can see a the scheme description, represented as `V0Scheme`, for trust
 quorum version 0 in link:./src/schemes/v0/mod.rs[].
 
+The bootstore also provides a mechanism for replicating and storing early
+networking configuration required to bring up the rest of the control plane.
+This is implemented in the `Node` type, at a layer above the LRTQ FSM.
 
 == Rack Secret Generation and Share distribution
 
@@ -23,20 +26,13 @@ all live under the link:./src/trust_quorum[] directory. We anticipate that
 the `RackSecret` generation algorithms will remain the same across all schemes
 for the foreseeable future. This will not be the case for the `packages` that
 distribute key shares and metadata to different sleds, and so we version these.
-We may  decide to move these "packages" into the specific scheme subdirectories
-because of this, but that is not done in the current code as written. The
-code is divided across a somewhat artificial line of protocol, provided in the
-schemes, and data, provided by the packages.
-
-While not yet done, the rack secret will be used as input key material to the
-link:../key-manager[] so that disks can be decrypted across sleds to allow the
-rack to boot.
-
-Thus while the schemes to distribute key shares and rotate and reconstruct the
-rack secret evolve to become more secure over time, the code using the rack
-secret can be kept consistent as it is just fed the same shape of input key
-material with each iteration.
 
+The rack secret is used as input key material to the link:../key-manager[] so
+that disks can be decrypted across sleds to allow the rack to boot. This allows
+the schemes to distribute key shares and rotate and reconstruct the rack secret
+to evolve and become more secure over time. It also enables the code using the
+rack secret to be kept consistent as long as it is fed the same shape of input
+key material.
 
 == Scheme V0
 
@@ -50,45 +46,6 @@ performs no I/O. This allows us to write property based tests that simulate an
 entire cluster of sleds operating in concert and ensure that any test failures
 shrink correctly.
 
-=== Future work
-
-The code that implements network IO is not currently written. It will be a thin
-shim of async code on top of the v0 `Fsm`. For this protocol to work, peer sled
-bootstrap addresses will be discovered via DDM, and persistent TCP connections
-established using a "downstream" strategy, where sleds with higher IP addresses
-connect to those with lower IP addresses. The first thing peers do when they
-connect to each other is exchange `Hello` messages to identify the scheme and
-protocol version. This is described in link:./src/schemes/mod.rs[], although
-it is not relevant to the FSM which operates solely inside the V0 scheme.
-After protocol negotiation each sides sends an `Identify` message as described
-in link:./src/schemes/v0/messages.rs[]. This servers to inform each sled the
-identity of its peer. The `Baseboard` serves as the unique identity for a peer
-and does not change over the lifetime of a sled. This message is only sent over
-the network links and is used by the networking layer to map TCP connections
-to peer IDs. The protocol FSM itself, however, is agnostic to the network layer
-other than knowing it uses reliable stream based communication. Protocol
-messages are only addressed via `Baseboard` and then routed appropriately by the
-network layer.
-
-The persistence of FSM state, shares, and metadata is also not yet implemented.
-We anticipate making the `PersistentState` or a serialized version of
-it `Ledgerable` and reusing the `Ledger` code in 
-link:../sled-agent/src/ledger.rs[]. For this to work, we'll have to extract the
-ledger code from `sled-agent` and put it in `omicron-common`.
-
-Once the network layer and persistence are implemented, the bootstore itself
-will be complete for scheme v0. We will then need to plug it into the bootstrap
-agent so that that sleds can communicate and recompute a rack secret. To decrypt
-the drives, we will need to create a `KeyRetriever` 
-(see link:../key-manager/src/lib.rs[]) and plug that into `KeyManager` 
-construction in the bootstrap agent. And of course, RSS will have to trigger the
-bootstore to generate the rack secret and and distribute the key share packages
-in the first place.
-
-The previous 3 paragraphs describe future work, although smaller and simpler
-than the bootstore protocol itself (hopefully). The remainder of this section
-will discuss the v0 Scheme.
-
 === Threat Model and Security Goals
 
 The trust quorum v0 scheme is limited in capability by the fact that a lot of
@@ -133,7 +90,7 @@ provide a key for each dataset on day 1. The question then becomes where do we
 get the input key material from. We essentially had 4 options:
 
  . Hardcode it - this is what is currently committed as the
-`LocalSecretRetreiver`
+`HardcodedSecretRetreiver`
  . Store random values constructed at RSS time on the M.2s and use them locally
  . Derive keys from VPD data unique to a sled
  . Build a simplified trust quorum scheme over untrusted channels on the
@@ -220,12 +177,12 @@ Learners will rotate through known peers until they find one that has a share.
 
 === Testing strategy
 
-The primary method of testing is generative testing via  https://proptest-
-rs.github.io/proptest/intro.html[proptest]. There are two property based tests:
-one for running as the `rack_coordinator` and one for running as a `learner`.
-Once the initial setup is performed to either initialize the rack, or learn
-a `LearnedSharePkg`, the tests largely share the same behavior in terms of
-processing generated `Action`s.
+The primary method of testing is generative testing via 
+https://proptest-rs.github.io/proptest/intro.html[proptest]. There are two
+property based tests: one for running as the `rack_coordinator` and one for
+running as a `learner`. Once the initial setup is performed to either initialize
+the rack, or learn a `LearnedSharePkg`, the tests largely share the same
+behavior in terms of processing generated `Action`s.
 
 One important thing to note is in regard to message responses. We always send
 responses to a request from the system under test (SUT) peer when a peer is