Skip to content

Conversation

@Sploder12
Copy link
Contributor

@Sploder12 Sploder12 commented Apr 24, 2025

This PR adds implementations for making VMs unavailable and available. Also has unavailable VMs ignore commands that change anything related to that VM. There are a few known race conditions that are outside the scope of AZs to fix.

MULTI-1789
MULTI-1941

@codecov
Copy link

codecov bot commented Apr 24, 2025

Codecov Report

❌ Patch coverage is 51.06383% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.14%. Comparing base (acba722) to head (f00a058).

Files with missing lines Patch % Lines
...latform/backends/shared/base_availability_zone.cpp 15.38% 11 Missing ⚠️
src/daemon/daemon.cpp 41.17% 10 Missing ⚠️
...rc/platform/backends/qemu/qemu_virtual_machine.cpp 50.00% 1 Missing ⚠️
.../platform/backends/shared/base_virtual_machine.cpp 93.33% 1 Missing ⚠️
Additional details and impacted files
@@                  Coverage Diff                   @@
##           availability-zones    #4061      +/-   ##
======================================================
- Coverage               89.24%   89.14%   -0.10%     
======================================================
  Files                     246      246              
  Lines                   15756    15796      +40     
======================================================
+ Hits                    14061    14081      +20     
- Misses                   1695     1715      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Sploder12 Sploder12 marked this pull request as ready for review April 24, 2025 18:22
@Sploder12 Sploder12 force-pushed the az-implementation branch 3 times, most recently from 94e31e3 to d597d38 Compare May 14, 2025 09:09

if (vm->current_state() == VirtualMachine::State::unavailable)
{
mpl::log(mpl::Level::info, name, "Ignoring mount since instance unavailable.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I think we need a better way to represent "unavailability". My main concern is that this approach is error-prone and repetitive. The problem is that "unavailable" instances are still a part of the operative instances, which is wrong. An unavailable instance is not operative since it can't be put into an operative state unless it's made available, so an unavailable instance shouldn't be a part of the "operative" set of VMs in the first place.

Right now, we have two sets of VMs: "operative" and "deleted". I think what we need here is a third category, "unavailable". This way, we can just update the error messages as does not exist or unavailable and call it a day, until we overhaul the VM management logic. @ricab, what do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point. I haven't thought a lot about it, but I think I agree with you @xmkg. We need to define also what happens at the intersection of deleted/unavailable i.e., deleted instances that belong to an unavailable AZ. I guess we just declare "deleted" takes precedence over "unavailable"? We need to answer a few questions:

  • does deleting and instance unsubscribe it from the AZ?
  • does enabling/disabling an AZ affect deleted instances? (I hope not)
  • what happens on recover of an instance whose AZ is unavailable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ricab, good questions. Regarding the deleted + unavailable at the same time, I think it's reasonable to make deleted take precedence over unavailable. Another avenue might be to consolidate all instances to a single container and represent the "availability" state in a bit flag. This way, an instance can be deleted & unavailable at the same time.

Consolidating all instances into a single container carries a cost: lookup and iteration speed. This can be troublesome for users with many instances. Otherwise, std::ranges filtering would work just fine. If iteration speed matters, another option might be using a multi-index container, which allows for different indices over a set of values instead of just one.

Regarding the last three questions, those should be answered in the AZ spec itself, if not already. @Sploder12 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xmkg I fully agree we need a better way to represent "unavailability", it is known that my current approach is terrible. But, my issue with using operative instances and a new unavailable table is that we'd be moving even more responsibility into the daemon. Also operative instances is the most unsafe thing in our entire code base at the moment. It guarantees race conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does deleting and instance unsubscribe it from the AZ?

Based on spec and implementation, no. Although it'd be nice if this were possible.

does enabling/disabling an AZ affect deleted instances? (I hope not)

Yes, it is necessary for recover in the current implementation :)

what happens on recover of an instance whose AZ is unavailable?

This is unaccounted for by the spec. Implementation recovers the instance in an unavailable state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sploder12 understood. I reviewed the patch in my local environment as well, and it seems like there is no common place that we can place the "unavailable" check. The commands are inconsistent in how they retrieve the instances. Some of them use the operative_instances directly, whereas some use the select_instances function, and so on. We desperately need to refactor this because it can be implemented in a much simpler and safer way.

@ricab, shall we proceed as-is and refactor this later on?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking quite a bit about this and wrote a bunch (sorry). We need to discuss, but I think the effect of what follows will be small for this PR. But better have our ideas in order.

Deleted instances vs AZs

I see two possible approaches:

  1. instances remain subscribed on deletion

    • deleted instances' state is updated on AZ status change. The following changes are possible on a deleted instance
      • arbitrary state -> unavailable (available -> unavailable is a particular case)
      • unavailable -> available
    • recovering a deleted instance does not touch the state
  2. instances unsubscribe on deletion

    • states don't change while an instance is deleted
    • instances would need to keep track of their AZ in a different way
    • recovering a deleted instance does two new things
      • resubscribe it to its AZ
      • update its state according to AZ state

If this was being implemented from scratch, I would probably prefer 2 and drop the AZ manager entity entirely, as discussed with Andrei in early AZ days. I still don't see the benefit, but it's not the time to U-turn now.

So, that leaves option 1. IIUC, that corresponds to what is implemented, correct @Sploder12?

Which container

operative_instances is just a name. The idea was to contrast it with deleted_instances and indicate that those are disjoint collections. Previously it was called instances (IIRC), which sounded like a superset. If necessary we could rename again (e.g. retained_instances).

I would love to have a single collection, as @xmkg suggested, with ranges and "deleted" as just another state in the FSM (same representation). It would simplify a lot of code.

Consolidating all instances into a single container carries a cost: lookup and iteration speed.

In the sizes we're we're talking about, I am pretty sure it would be OK, if we did it right. If the deleted state was directly available in a plain field of the VM class, I would be willing to bet a user with 1000 deleted instances and 1 other instance would notice no degradation of performance with one vs two hash tables. 1000 iterations of "if X continue" are really fast, especially if X is already cached. It might even work the other way around due to cache locality. And I'd bet very few people ever reached 100 instances anyway.

We would have to do it right though, in the sense that we would need to make sure we shortcut the state update when it was deleted, preferably minimizing virtual calls in that case (another case for template method). 1000 iterations of going to the backend to update the state would be very noticeable.

I suspect that perf did not play much of a role in the decision to have 2 collections in the first place. My guess is that it was just the immediate representation of a trash can in code. Idk...

All that said, this would be a very significant refactor. I wonder if AZs are the right place for it... Especially given that we don't have ranges yet.

@ricab, shall we proceed as-is and refactor this later on?

Meh, probably... :/


if (vm->current_state() == VirtualMachine::State::unavailable)
{
return fmt::to_string(errors);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe push a message into "errors", stating that a VM is unavailable here?

@sharder996 sharder996 requested review from ricab and xmkg June 27, 2025 15:27
Base automatically changed from az-integration to availability-zones July 10, 2025 17:57

if (vm->current_state() == VirtualMachine::State::unavailable)
{
mpl::log(mpl::Level::info, name, "Ignoring mount since instance unavailable.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking quite a bit about this and wrote a bunch (sorry). We need to discuss, but I think the effect of what follows will be small for this PR. But better have our ideas in order.

Deleted instances vs AZs

I see two possible approaches:

  1. instances remain subscribed on deletion

    • deleted instances' state is updated on AZ status change. The following changes are possible on a deleted instance
      • arbitrary state -> unavailable (available -> unavailable is a particular case)
      • unavailable -> available
    • recovering a deleted instance does not touch the state
  2. instances unsubscribe on deletion

    • states don't change while an instance is deleted
    • instances would need to keep track of their AZ in a different way
    • recovering a deleted instance does two new things
      • resubscribe it to its AZ
      • update its state according to AZ state

If this was being implemented from scratch, I would probably prefer 2 and drop the AZ manager entity entirely, as discussed with Andrei in early AZ days. I still don't see the benefit, but it's not the time to U-turn now.

So, that leaves option 1. IIUC, that corresponds to what is implemented, correct @Sploder12?

Which container

operative_instances is just a name. The idea was to contrast it with deleted_instances and indicate that those are disjoint collections. Previously it was called instances (IIRC), which sounded like a superset. If necessary we could rename again (e.g. retained_instances).

I would love to have a single collection, as @xmkg suggested, with ranges and "deleted" as just another state in the FSM (same representation). It would simplify a lot of code.

Consolidating all instances into a single container carries a cost: lookup and iteration speed.

In the sizes we're we're talking about, I am pretty sure it would be OK, if we did it right. If the deleted state was directly available in a plain field of the VM class, I would be willing to bet a user with 1000 deleted instances and 1 other instance would notice no degradation of performance with one vs two hash tables. 1000 iterations of "if X continue" are really fast, especially if X is already cached. It might even work the other way around due to cache locality. And I'd bet very few people ever reached 100 instances anyway.

We would have to do it right though, in the sense that we would need to make sure we shortcut the state update when it was deleted, preferably minimizing virtual calls in that case (another case for template method). 1000 iterations of going to the backend to update the state would be very noticeable.

I suspect that perf did not play much of a role in the decision to have 2 collections in the first place. My guess is that it was just the immediate representation of a trash can in code. Idk...

All that said, this would be a very significant refactor. I wonder if AZs are the right place for it... Especially given that we don't have ranges yet.

@ricab, shall we proceed as-is and refactor this later on?

Meh, probably... :/

Copy link
Collaborator

@ricab ricab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Sploder12. Overall this looks close to what we need, but I have a few comments for discussion.

@Sploder12 Sploder12 force-pushed the availability-zones branch 3 times, most recently from 2331482 to 3c8321a Compare August 25, 2025 21:55
@Sploder12 Sploder12 force-pushed the az-implementation branch 4 times, most recently from 751ae6a to 56ad56f Compare September 2, 2025 23:23
@Sploder12 Sploder12 force-pushed the availability-zones branch 2 times, most recently from f21b770 to 3098c8f Compare September 15, 2025 09:26
@Sploder12 Sploder12 force-pushed the az-implementation branch 2 times, most recently from dc1613a to dc819a6 Compare September 15, 2025 10:22
@sharder996 sharder996 requested review from ricab and tobe2098 October 9, 2025 14:07
Copy link
Collaborator

@ricab ricab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

Comment on lines +2515 to +2384
if (vm.current_state() == VirtualMachine::State::unavailable)
{
mpl::log(mpl::Level::info,
vm.vm_name,
"Ignoring suspend since instance is unavailable.");
return grpc::Status::OK;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem: shouldn't this be handled by the VM? It would go in the direction of relieving the daemon from such concerns.

m.available = true;
serialize();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we just do this at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the catch is rethrowing, control flow would never reach the end of the function

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. You could use a scope guard, but no need really.

return;
}

was_running = state == State::running || state == State::starting;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the state was restarting?

@Sploder12 Sploder12 force-pushed the az-implementation branch 2 times, most recently from 26948b9 to eeceaee Compare October 21, 2025 20:19
@Sploder12 Sploder12 force-pushed the az-implementation branch 2 times, most recently from b7c9b32 to 7104ebb Compare October 21, 2025 20:38
@Sploder12 Sploder12 force-pushed the availability-zones branch 6 times, most recently from 2af4bcb to acba722 Compare October 22, 2025 00:51
@Sploder12 Sploder12 requested a review from ricab October 22, 2025 15:54
Copy link
Collaborator

@ricab ricab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have a few suggestions, but I will leave them to your consideration and pre-approve.

name);
"Cannot start the instance '{}' while {}.",
name,
vm.current_state() == VirtualMachine::State::suspending ? "suspending"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be easier once the Hyper-V branch is in: https://github.com/canonical/multipass/pull/4080/files#diff-d847b18fbc46d755306da3eb6cac52de71043c1cc87ff345cdf8b034ad141ffaR158

You could add a TODO here, or convince @xmkg to split that off 😉

Comment on lines +207 to +210
if (state == State::unavailable)
{
throw VMStateIdempotentException{"Ignoring shutdown since instance is unavailable."};
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about merging with the above branch here as well. Again, the formatter would help. If you want to simplify, I wouldn't be shocked if the message read "Ignoring shutdown since instance is already unavailable".

m.available = true;
serialize();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. You could use a scope guard, but no need really.

}
else if (present_state == State::unavailable)
{
mpl::log(mpl::Level::info, vm_name, "Ignoring suspend since instance is unavailable");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem: merge with the above?

Otherwise, use mpl::info?

mpl::info(vm_name, "Ignoring suspend issued while stopped/suspended");
mpl::log(mpl::Level::info,
vm_name,
"Ignoring suspend issued while stopped/suspended/unavailable");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem: state formatting would help

Comment on lines +134 to +140
// setting the state here breaks encapsulation, but it's already broken.
std::unique_lock vm_lock{vm.get().state_mutex};
if (vm.get().current_state() == VirtualMachine::State::unavailable)
{
vm.get().state = VirtualMachine::State::off;
vm.get().update_state();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered force-stop instead? That would avoid breaking encapsulation yet again and make sure there were no leftover processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants