Skip to content

add omdb read and write method for bootstore #8476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

internet-diglett
Copy link
Contributor

No description provided.

@andrewjstone
Copy link
Contributor

andrewjstone commented Jun 28, 2025

Just got home from a concert, and noticed this PR. I'm curious what the use case is.

The bootstore is eventually consistent and should only be written to from nexus with data that is in crdb. It's a cache of nexus data. Two different values with the same version being written could be catastrophic. Since there there is no way to get a consistent read and values are gossiped around, you can end up accidentally reusing versions and causing collisions, which is why we serialize all writes through nexus first inside a transaction.

We really shouldn't even allow reading from the bootstore outside of its local sled agent. The only reason we exposed that to was as a stop gap for FCS.l, but I don't remember offhand why the read was required. Instead of adding new functionality we should make the nexus->sled agent interface write only. I'd be more than happy to spend some time next week cleaning up the existing usage to make it safer. Then we can discuss what other functionality you need.

I'm not saying you don't have a good reason to add these methods, but there is probably a safer way to do it.

Let's talk about it on Monday.

This is actually a great reminder to get back to this issue and I am looking forward to working with you on it @internet-diglett. Kinda perfect timing actually as trust quorum work is ramping up

@internet-diglett
Copy link
Contributor Author

Just got home from a concert, and noticed this PR. I'm curious what the use case is.

@taspelund was brainstorming some ideas for some example "broken" a4x2 topologies for internal training and troubleshooting and asked me how one would go about seeing what network configuration the sleds had in a pre-nexus world. This was just something I whipped up for him while we were chatting to let him take a look at the early network config and poke at it for his labbing.

We also noted from one of our customer installs that having to do a clean slate to update an incorrect network config (i.e., a config that prevented nexus from coming up for the first time) was very time consuming and it would be nice to have an additional mechanism to update that, however I don't necessarily think this should be that since it would probably be better for such features to live inside of wicket.

We really shouldn't even allow reading from the bootstore outside of its local sled agent. The only reason we exposed that to was as a stop gap for FCS.l, but I don't remember offhand why the read was required.

We also currently rely on the read functionality on the first cycle of the network synchronization RPW in nexus.

Instead of adding new functionality we should make the nexus->sled agent interface write only.

Is it really a bad thing to have a way to observe what is going on in the bootstore, even if it's just for our internal use? I recall being able to see what was in the bootstore was very valuable for troubleshooting issues on dogfood. This just exposes that ability to omdb instead of someone manually crafting a curl command.

@internet-diglett
Copy link
Contributor Author

Also I didn't really mean to create a PR for this, it was just habit. I mainly pushed this branch to share it with @taspelund for his labbing

@andrewjstone
Copy link
Contributor

@taspelund was brainstorming some ideas for some example "broken" a4x2 topologies for internal training and troubleshooting and asked me how one would go about seeing what network configuration the sleds had in a pre-nexus world. This was just something I whipped up for him while we were chatting to let him take a look at the early network config and poke at it for his labbing.

We also noted from one of our customer installs that having to do a clean slate to update an incorrect network config (i.e., a config that prevented nexus from coming up for the first time) was very time consuming and it would be nice to have an additional mechanism to update that, however I don't necessarily think this should be that since it would probably be better for such features to live inside of wicket.

Ah, that all makes sense. Pre-nexus is a good use case for this type of introspection, and cleanup. I don't think it will actually work to do the cleanup this way, because you need to delete the bootstore files. Unfortunately, they'll immediately be replicated again, unless you shut down the sled-agents, which wouldn't allow you to delete them. Fun circular dep there. We can do better with real trust quorum here and first class a clean slate I think.

We also currently rely on the read functionality on the first cycle of the network synchronization RPW in nexus.

Right, yeah, this was the thing I was thinking of that we needed for FCS ship. I'd like to rework this so that the data gets pre-populated into nexus from RSS and remove this altogether from the RPW. Clearly it works as is, and I remember working with you on it before we shipped. We also discussed removing the reads at some point. Maybe that point is now 😄 ?

Is it really a bad thing to have a way to observe what is going on in the bootstore, even if it's just for our internal use? I recall being able to see what was in the bootstore was very valuable for troubleshooting issues on dogfood. This just exposes that ability to omdb instead of someone manually crafting a curl command.

No, it's not bad. You are totally correct that this is useful, and I was being shortsighted in my comment. We definitely have logged in and manually read the files, and doing this from OMDB which could also decode the JSON and pretty-print would be a big win. Clearly I jumped the gun here about removing the functionality altogether. We just have to keep in mind that a read from one sled-agent is just only local to that sled-agent, but that's basically what OMDB is designed for anyway. I think my brain was fried from sitting in traffic all day yesterday lol.

Also I didn't really mean to create a PR for this, it was just habit. I mainly pushed this branch to share it with @taspelund for his labbing

Don't let me dissuade you from opening PRs for stuff, even if its just temporary! You marked it as draft, and It just immediately piqued my interest. I should have just pinged you directly on monday in chat, but wanted to make sure you didn't spend time building something without some caution. Also, I realize everything that I wrote last night is stuff you basically know, but it's been about 2 years since we've discussed this so figured I'd lay it out again.

Also, I really am wondering if there are enhancements you'd like to see made to the bootstore, as it's something that I can bake in now that I'm working on trust-quorum again.

@jgallagher
Copy link
Contributor

We also noted from one of our customer installs that having to do a clean slate to update an incorrect network config (i.e., a config that prevented nexus from coming up for the first time) was very time consuming and it would be nice to have an additional mechanism to update that, however I don't necessarily think this should be that since it would probably be better for such features to live inside of wicket.

Ah, that all makes sense. Pre-nexus is a good use case for this type of introspection, and cleanup. I don't think it will actually work to do the cleanup this way, because you need to delete the bootstore files. Unfortunately, they'll immediately be replicated again, unless you shut down the sled-agents, which wouldn't allow you to delete them. Fun circular dep there. We can do better with real trust quorum here and first class a clean slate I think.

With apologies for the drive-by comment: I have no objection at all to reading from omdb, but writing seems kinda fishy to me, at least for the during-RSS-pre-Nexus phase. If we write directly to the bootstore, that's now out of sync with what RSS thinks it's trying to do, right? Even if that happens to work it seems like a pretty bad place to be. I think we'd be much better off reworking RSS to accept different configs and have it push any changes that are necessary out. (This is on the roadmap as #6906, and there a bunch of related issues in that area.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants