Az implementation #4061

Sploder12 · 2025-04-24T16:26:56Z

This PR adds implementations for making VMs unavailable and available. Also has unavailable VMs ignore commands that change anything related to that VM. There are a few known race conditions that are outside the scope of AZs to fix.

MULTI-1789
MULTI-1941

codecov · 2025-04-24T17:07:29Z

Codecov Report

❌ Patch coverage is 51.06383% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.14%. Comparing base (acba722) to head (f00a058).

Files with missing lines	Patch %	Lines
...latform/backends/shared/base_availability_zone.cpp	15.38%	11 Missing ⚠️
src/daemon/daemon.cpp	41.17%	10 Missing ⚠️
...rc/platform/backends/qemu/qemu_virtual_machine.cpp	50.00%	1 Missing ⚠️
.../platform/backends/shared/base_virtual_machine.cpp	93.33%	1 Missing ⚠️

Additional details and impacted files

@@                  Coverage Diff                   @@
##           availability-zones    #4061      +/-   ##
======================================================
- Coverage               89.24%   89.14%   -0.10%     
======================================================
  Files                     246      246              
  Lines                   15756    15796      +40     
======================================================
+ Hits                    14061    14081      +20     
- Misses                   1695     1715      +20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/daemon/daemon.cpp

xmkg · 2025-05-26T13:55:07Z

src/daemon/daemon.cpp


+        if (vm->current_state() == VirtualMachine::State::unavailable)
+        {
+            mpl::log(mpl::Level::info, name, "Ignoring mount since instance unavailable.");


Hmm... I think we need a better way to represent "unavailability". My main concern is that this approach is error-prone and repetitive. The problem is that "unavailable" instances are still a part of the operative instances, which is wrong. An unavailable instance is not operative since it can't be put into an operative state unless it's made available, so an unavailable instance shouldn't be a part of the "operative" set of VMs in the first place.

Right now, we have two sets of VMs: "operative" and "deleted". I think what we need here is a third category, "unavailable". This way, we can just update the error messages as does not exist or unavailable and call it a day, until we overhaul the VM management logic. @ricab, what do you think?

Hmm good point. I haven't thought a lot about it, but I think I agree with you @xmkg. We need to define also what happens at the intersection of deleted/unavailable i.e., deleted instances that belong to an unavailable AZ. I guess we just declare "deleted" takes precedence over "unavailable"? We need to answer a few questions:

does deleting and instance unsubscribe it from the AZ?

does enabling/disabling an AZ affect deleted instances? (I hope not)

what happens on recover of an instance whose AZ is unavailable?

Hi @ricab, good questions. Regarding the deleted + unavailable at the same time, I think it's reasonable to make deleted take precedence over unavailable. Another avenue might be to consolidate all instances to a single container and represent the "availability" state in a bit flag. This way, an instance can be deleted & unavailable at the same time.

Consolidating all instances into a single container carries a cost: lookup and iteration speed. This can be troublesome for users with many instances. Otherwise, std::ranges filtering would work just fine. If iteration speed matters, another option might be using a multi-index container, which allows for different indices over a set of values instead of just one.

Regarding the last three questions, those should be answered in the AZ spec itself, if not already. @Sploder12 😉

@xmkg I fully agree we need a better way to represent "unavailability", it is known that my current approach is terrible. But, my issue with using operative instances and a new unavailable table is that we'd be moving even more responsibility into the daemon. Also operative instances is the most unsafe thing in our entire code base at the moment. It guarantees race conditions.

does deleting and instance unsubscribe it from the AZ?

Based on spec and implementation, no. Although it'd be nice if this were possible.

does enabling/disabling an AZ affect deleted instances? (I hope not)

Yes, it is necessary for recover in the current implementation :)

what happens on recover of an instance whose AZ is unavailable?

This is unaccounted for by the spec. Implementation recovers the instance in an unavailable state.

@Sploder12 understood. I reviewed the patch in my local environment as well, and it seems like there is no common place that we can place the "unavailable" check. The commands are inconsistent in how they retrieve the instances. Some of them use the operative_instances directly, whereas some use the select_instances function, and so on. We desperately need to refactor this because it can be implemented in a much simpler and safer way.

@ricab, shall we proceed as-is and refactor this later on?

I've been thinking quite a bit about this and wrote a bunch (sorry). We need to discuss, but I think the effect of what follows will be small for this PR. But better have our ideas in order.

Deleted instances vs AZs

I see two possible approaches:

instances remain subscribed on deletion

deleted instances' state is updated on AZ status change. The following changes are possible on a deleted instance

arbitrary state -> unavailable (available -> unavailable is a particular case)

unavailable -> available

recovering a deleted instance does not touch the state

instances unsubscribe on deletion

states don't change while an instance is deleted

instances would need to keep track of their AZ in a different way

recovering a deleted instance does two new things

resubscribe it to its AZ

update its state according to AZ state

If this was being implemented from scratch, I would probably prefer 2 and drop the AZ manager entity entirely, as discussed with Andrei in early AZ days. I still don't see the benefit, but it's not the time to U-turn now.

So, that leaves option 1. IIUC, that corresponds to what is implemented, correct @Sploder12?

Which container

operative_instances is just a name. The idea was to contrast it with deleted_instances and indicate that those are disjoint collections. Previously it was called instances (IIRC), which sounded like a superset. If necessary we could rename again (e.g. retained_instances).

I would love to have a single collection, as @xmkg suggested, with ranges and "deleted" as just another state in the FSM (same representation). It would simplify a lot of code.

Consolidating all instances into a single container carries a cost: lookup and iteration speed.

In the sizes we're we're talking about, I am pretty sure it would be OK, if we did it right. If the deleted state was directly available in a plain field of the VM class, I would be willing to bet a user with 1000 deleted instances and 1 other instance would notice no degradation of performance with one vs two hash tables. 1000 iterations of "if X continue" are really fast, especially if X is already cached. It might even work the other way around due to cache locality. And I'd bet very few people ever reached 100 instances anyway.

We would have to do it right though, in the sense that we would need to make sure we shortcut the state update when it was deleted, preferably minimizing virtual calls in that case (another case for template method). 1000 iterations of going to the backend to update the state would be very noticeable.

I suspect that perf did not play much of a role in the decision to have 2 collections in the first place. My guess is that it was just the immediate representation of a trash can in code. Idk...

All that said, this would be a very significant refactor. I wonder if AZs are the right place for it... Especially given that we don't have ranges yet.

@ricab, shall we proceed as-is and refactor this later on?

Meh, probably... :/

xmkg · 2025-06-27T11:20:58Z

src/daemon/daemon.cpp

+
+        if (vm->current_state() == VirtualMachine::State::unavailable)
+        {
+            return fmt::to_string(errors);


Maybe push a message into "errors", stating that a VM is unavailable here?

include/multipass/virtual_machine.h

ricab · 2025-08-20T22:42:23Z

src/daemon/daemon.cpp


+        if (vm->current_state() == VirtualMachine::State::unavailable)
+        {
+            mpl::log(mpl::Level::info, name, "Ignoring mount since instance unavailable.");


I've been thinking quite a bit about this and wrote a bunch (sorry). We need to discuss, but I think the effect of what follows will be small for this PR. But better have our ideas in order.

Deleted instances vs AZs

I see two possible approaches:

instances remain subscribed on deletion

deleted instances' state is updated on AZ status change. The following changes are possible on a deleted instance

arbitrary state -> unavailable (available -> unavailable is a particular case)

unavailable -> available

recovering a deleted instance does not touch the state

instances unsubscribe on deletion

states don't change while an instance is deleted

instances would need to keep track of their AZ in a different way

recovering a deleted instance does two new things

resubscribe it to its AZ

update its state according to AZ state

If this was being implemented from scratch, I would probably prefer 2 and drop the AZ manager entity entirely, as discussed with Andrei in early AZ days. I still don't see the benefit, but it's not the time to U-turn now.

So, that leaves option 1. IIUC, that corresponds to what is implemented, correct @Sploder12?

Which container

operative_instances is just a name. The idea was to contrast it with deleted_instances and indicate that those are disjoint collections. Previously it was called instances (IIRC), which sounded like a superset. If necessary we could rename again (e.g. retained_instances).

I would love to have a single collection, as @xmkg suggested, with ranges and "deleted" as just another state in the FSM (same representation). It would simplify a lot of code.

Consolidating all instances into a single container carries a cost: lookup and iteration speed.

In the sizes we're we're talking about, I am pretty sure it would be OK, if we did it right. If the deleted state was directly available in a plain field of the VM class, I would be willing to bet a user with 1000 deleted instances and 1 other instance would notice no degradation of performance with one vs two hash tables. 1000 iterations of "if X continue" are really fast, especially if X is already cached. It might even work the other way around due to cache locality. And I'd bet very few people ever reached 100 instances anyway.

We would have to do it right though, in the sense that we would need to make sure we shortcut the state update when it was deleted, preferably minimizing virtual calls in that case (another case for template method). 1000 iterations of going to the backend to update the state would be very noticeable.

I suspect that perf did not play much of a role in the decision to have 2 collections in the first place. My guess is that it was just the immediate representation of a trash can in code. Idk...

All that said, this would be a very significant refactor. I wonder if AZs are the right place for it... Especially given that we don't have ranges yet.

@ricab, shall we proceed as-is and refactor this later on?

Meh, probably... :/

ricab

Thanks @Sploder12. Overall this looks close to what we need, but I have a few comments for discussion.

src/daemon/daemon.cpp

src/platform/backends/shared/base_virtual_machine.cpp

src/daemon/daemon.cpp

ricab

Almost there!

src/daemon/daemon.cpp

ricab · 2025-10-17T16:07:19Z

src/daemon/daemon.cpp

+            if (vm.current_state() == VirtualMachine::State::unavailable)
+            {
+                mpl::log(mpl::Level::info,
+                         vm.vm_name,
+                         "Ignoring suspend since instance is unavailable.");
+                return grpc::Status::OK;
+            }


Idem: shouldn't this be handled by the VM? It would go in the direction of relieving the daemon from such concerns.

ricab · 2025-10-17T18:30:39Z

src/platform/backends/shared/base_availability_zone.cpp

+        m.available = true;
+        serialize();


Couldn't we just do this at the end?

since the catch is rethrowing, control flow would never reach the end of the function

Right. You could use a scope guard, but no need really.

ricab · 2025-10-17T18:46:56Z

src/platform/backends/shared/base_virtual_machine.cpp

+        return;
+    }
+
+    was_running = state == State::running || state == State::starting;


What if the state was restarting?

ricab

I still have a few suggestions, but I will leave them to your consideration and pre-approve.

ricab · 2025-10-23T17:03:52Z

src/daemon/daemon.cpp

-                           name);
+                           "Cannot start the instance '{}' while {}.",
+                           name,
+                           vm.current_state() == VirtualMachine::State::suspending ? "suspending"


This will be easier once the Hyper-V branch is in: https://github.com/canonical/multipass/pull/4080/files#diff-d847b18fbc46d755306da3eb6cac52de71043c1cc87ff345cdf8b034ad141ffaR158

You could add a TODO here, or convince @xmkg to split that off 😉

ricab · 2025-10-23T17:12:32Z

src/platform/backends/shared/base_virtual_machine.cpp

+    if (state == State::unavailable)
+    {
+        throw VMStateIdempotentException{"Ignoring shutdown since instance is unavailable."};
+    }


How about merging with the above branch here as well. Again, the formatter would help. If you want to simplify, I wouldn't be shocked if the message read "Ignoring shutdown since instance is already unavailable".

ricab · 2025-10-23T17:30:03Z

src/platform/backends/shared/base_availability_zone.cpp

+        m.available = true;
+        serialize();


Right. You could use a scope guard, but no need really.

ricab · 2025-10-23T17:31:39Z

src/platform/backends/hyperv/hyperv_virtual_machine.cpp

    }
+    else if (present_state == State::unavailable)
+    {
+        mpl::log(mpl::Level::info, vm_name, "Ignoring suspend since instance is unavailable");


idem: merge with the above?

Otherwise, use mpl::info?

ricab · 2025-10-23T17:32:15Z

src/platform/backends/qemu/qemu_virtual_machine.cpp

-        mpl::info(vm_name, "Ignoring suspend issued while stopped/suspended");
+        mpl::log(mpl::Level::info,
+                 vm_name,
+                 "Ignoring suspend issued while stopped/suspended/unavailable");


idem: state formatting would help

ricab · 2025-10-23T17:35:14Z

src/platform/backends/shared/base_availability_zone.cpp

+            // setting the state here breaks encapsulation, but it's already broken.
+            std::unique_lock vm_lock{vm.get().state_mutex};
+            if (vm.get().current_state() == VirtualMachine::State::unavailable)
+            {
+                vm.get().state = VirtualMachine::State::off;
+                vm.get().update_state();
+            }


Have you considered force-stop instead? That would avoid breaking encapsulation yet again and make sure there were no leftover processes.

Sploder12 marked this pull request as ready for review April 24, 2025 18:22

xmkg reviewed Apr 24, 2025

View reviewed changes

src/daemon/daemon.cpp Outdated Show resolved Hide resolved

Sploder12 force-pushed the az-implementation branch from 3763140 to 6295079 Compare April 25, 2025 04:13

Sploder12 force-pushed the az-integration branch from 80b16de to 8d42803 Compare May 8, 2025 20:35

Sploder12 force-pushed the az-implementation branch 3 times, most recently from 94e31e3 to d597d38 Compare May 14, 2025 09:09

xmkg reviewed May 26, 2025

View reviewed changes

xmkg reviewed Jun 27, 2025

View reviewed changes

sharder996 requested review from ricab and xmkg June 27, 2025 15:27

Base automatically changed from az-integration to availability-zones July 10, 2025 17:57

Sploder12 force-pushed the availability-zones branch from b8474ed to 89a39fd Compare August 18, 2025 21:55

Sploder12 force-pushed the az-implementation branch from d597d38 to f161bd1 Compare August 18, 2025 23:43

ricab reviewed Aug 21, 2025

View reviewed changes

Sploder12 force-pushed the availability-zones branch 3 times, most recently from 2331482 to 3c8321a Compare August 25, 2025 21:55

Sploder12 force-pushed the az-implementation branch 4 times, most recently from 751ae6a to 56ad56f Compare September 2, 2025 23:23

Sploder12 force-pushed the az-implementation branch from 93f4fc2 to c6a9e32 Compare September 15, 2025 08:58

Sploder12 force-pushed the availability-zones branch 2 times, most recently from f21b770 to 3098c8f Compare September 15, 2025 09:26

Sploder12 force-pushed the az-implementation branch 2 times, most recently from dc1613a to dc819a6 Compare September 15, 2025 10:22

Sploder12 force-pushed the availability-zones branch from 3098c8f to 6c91f0b Compare September 22, 2025 07:27

sharder996 requested review from ricab and tobe2098 October 9, 2025 14:07

ricab reviewed Oct 17, 2025

View reviewed changes

Sploder12 force-pushed the availability-zones branch from b20f21b to 4029149 Compare October 21, 2025 19:14

Sploder12 force-pushed the az-implementation branch 2 times, most recently from 26948b9 to eeceaee Compare October 21, 2025 20:19

Sploder12 force-pushed the availability-zones branch from 3655de5 to a66e7a8 Compare October 21, 2025 20:23

Sploder12 force-pushed the az-implementation branch 2 times, most recently from b7c9b32 to 7104ebb Compare October 21, 2025 20:38

Sploder12 force-pushed the availability-zones branch 6 times, most recently from 2af4bcb to acba722 Compare October 22, 2025 00:51

andrei-toterman and others added 12 commits October 21, 2025 21:56

[tests] Make and use stub az classes

a444f51

[az] Add az manager into vm factory ctors

94648e1

[tests] Make tests compile with azs

4cdd38d

[az] Implement set_available

5ec148f

[az] Handle unavailable state in daemon

aeab9a5

[az] Handle set_available failures

bc3e81c

[az] Handle merger rebase

59dc732

[az] Fix on_restart tests

48d6ec7

[az] Cleanup rebase

acc2e04

[az] Address review comments

f000acf

[az] Add base vm tests for set_available

6739f0b

[az] Cleanup logging logic

f00a058

Sploder12 force-pushed the az-implementation branch from 7104ebb to f00a058 Compare October 22, 2025 01:56

Sploder12 requested a review from ricab October 22, 2025 15:54

ricab approved these changes Oct 23, 2025

View reviewed changes

Az implementation #4061

Are you sure you want to change the base?

Az implementation #4061

Uh oh!

Conversation

Sploder12 commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Deleted instances vs AZs

Which container

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Deleted instances vs AZs

Which container

Uh oh!

ricab left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ricab left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ricab left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Sploder12 commented Apr 24, 2025 •

edited

Loading

codecov bot commented Apr 24, 2025 •

edited

Loading