- 
                Notifications
    You must be signed in to change notification settings 
- Fork 736
Az implementation #4061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: availability-zones
Are you sure you want to change the base?
Az implementation #4061
Conversation
| Codecov Report❌ Patch coverage is  Additional details and impacted files@@                  Coverage Diff                   @@
##           availability-zones    #4061      +/-   ##
======================================================
- Coverage               89.24%   89.14%   -0.10%     
======================================================
  Files                     246      246              
  Lines                   15756    15796      +40     
======================================================
+ Hits                    14061    14081      +20     
- Misses                   1695     1715      +20     ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
 | 
3763140    to
    6295079      
    Compare
  
    94e31e3    to
    d597d38      
    Compare
  
            
          
                src/daemon/daemon.cpp
              
                Outdated
          
        
      |  | ||
| if (vm->current_state() == VirtualMachine::State::unavailable) | ||
| { | ||
| mpl::log(mpl::Level::info, name, "Ignoring mount since instance unavailable."); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I think we need a better way to represent "unavailability". My main concern is that this approach is error-prone and repetitive. The problem is that "unavailable" instances are still a part of the operative instances, which is wrong. An unavailable instance is not operative since it can't be put into an operative state unless it's made available, so an unavailable instance shouldn't be a part of the "operative" set of VMs in the first place.
Right now, we have two sets of VMs: "operative" and "deleted". I think what we need here is a third category, "unavailable". This way, we can just update the error messages as does not exist or unavailable and call it a day, until we overhaul the VM management logic.  @ricab, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm good point. I haven't thought a lot about it, but I think I agree with you @xmkg. We need to define also what happens at the intersection of deleted/unavailable i.e., deleted instances that belong to an unavailable AZ. I guess we just declare "deleted" takes precedence over "unavailable"? We need to answer a few questions:
- does deleting and instance unsubscribe it from the AZ?
- does enabling/disabling an AZ affect deleted instances? (I hope not)
- what happens on recover of an instance whose AZ is unavailable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ricab, good questions. Regarding the deleted + unavailable at the same time, I think it's reasonable to make deleted take precedence over unavailable. Another avenue might be to consolidate all instances to a single container and represent the "availability" state in a bit flag. This way, an instance can be deleted & unavailable at the same time.
Consolidating all instances into a single container carries a cost: lookup and iteration speed. This can be troublesome for users with many instances. Otherwise, std::ranges filtering would work just fine. If iteration speed matters, another option might be using a multi-index container, which allows for different indices over a set of values instead of just one.
Regarding the last three questions, those should be answered in the AZ spec itself, if not already. @Sploder12 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xmkg I fully agree we need a better way to represent "unavailability", it is known that my current approach is terrible. But, my issue with using operative instances and a new unavailable table is that we'd be moving even more responsibility into the daemon. Also operative instances is the most unsafe thing in our entire code base at the moment. It guarantees race conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does deleting and instance unsubscribe it from the AZ?
Based on spec and implementation, no. Although it'd be nice if this were possible.
does enabling/disabling an AZ affect deleted instances? (I hope not)
Yes, it is necessary for recover in the current implementation :)
what happens on recover of an instance whose AZ is unavailable?
This is unaccounted for by the spec. Implementation recovers the instance in an unavailable state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sploder12 understood. I reviewed the patch in my local environment as well, and it seems like there is no common place that we can place the "unavailable" check. The commands are inconsistent in how they retrieve the instances. Some of them use the operative_instances directly, whereas some use the select_instances function, and so on. We desperately need to refactor this because it can be implemented in a much simpler and safer way.
@ricab, shall we proceed as-is and refactor this later on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking quite a bit about this and wrote a bunch (sorry). We need to discuss, but I think the effect of what follows will be small for this PR. But better have our ideas in order.
Deleted instances vs AZs
I see two possible approaches:
- 
instances remain subscribed on deletion - deleted instances' state is updated on AZ status change. The following changes are possible on a deleted instance
- arbitrary state -> unavailable (available -> unavailable is a particular case)
- unavailable -> available
 
- recovering a deleted instance does not touch the state
 
- deleted instances' state is updated on AZ status change. The following changes are possible on a deleted instance
- 
instances unsubscribe on deletion - states don't change while an instance is deleted
- instances would need to keep track of their AZ in a different way
- recovering a deleted instance does two new things
- resubscribe it to its AZ
- update its state according to AZ state
 
 
If this was being implemented from scratch, I would probably prefer 2 and drop the AZ manager entity entirely, as discussed with Andrei in early AZ days. I still don't see the benefit, but it's not the time to U-turn now.
So, that leaves option 1. IIUC, that corresponds to what is implemented, correct @Sploder12?
Which container
operative_instances is just a name. The idea was to contrast it with deleted_instances and indicate that those are disjoint collections. Previously it was called instances (IIRC), which sounded like a superset. If necessary we could rename again (e.g. retained_instances).
I would love to have a single collection, as @xmkg suggested, with ranges and "deleted" as just another state in the FSM (same representation). It would simplify a lot of code.
Consolidating all instances into a single container carries a cost: lookup and iteration speed.
In the sizes we're we're talking about, I am pretty sure it would be OK, if we did it right. If the deleted state was directly available in a plain field of the VM class, I would be willing to bet a user with 1000 deleted instances and 1 other instance would notice no degradation of performance with one vs two hash tables. 1000 iterations of "if X continue" are really fast, especially if X is already cached. It might even work the other way around due to cache locality. And I'd bet very few people ever reached 100 instances anyway.
We would have to do it right though, in the sense that we would need to make sure we shortcut the state update when it was deleted, preferably minimizing virtual calls in that case (another case for template method). 1000 iterations of going to the backend to update the state would be very noticeable.
I suspect that perf did not play much of a role in the decision to have 2 collections in the first place. My guess is that it was just the immediate representation of a trash can in code. Idk...
All that said, this would be a very significant refactor. I wonder if AZs are the right place for it... Especially given that we don't have ranges yet.
@ricab, shall we proceed as-is and refactor this later on?
Meh, probably... :/
        
          
                src/daemon/daemon.cpp
              
                Outdated
          
        
      |  | ||
| if (vm->current_state() == VirtualMachine::State::unavailable) | ||
| { | ||
| return fmt::to_string(errors); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe push a message into "errors", stating that a VM is unavailable here?
b8474ed    to
    89a39fd      
    Compare
  
    d597d38    to
    f161bd1      
    Compare
  
            
          
                src/daemon/daemon.cpp
              
                Outdated
          
        
      |  | ||
| if (vm->current_state() == VirtualMachine::State::unavailable) | ||
| { | ||
| mpl::log(mpl::Level::info, name, "Ignoring mount since instance unavailable."); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking quite a bit about this and wrote a bunch (sorry). We need to discuss, but I think the effect of what follows will be small for this PR. But better have our ideas in order.
Deleted instances vs AZs
I see two possible approaches:
- 
instances remain subscribed on deletion - deleted instances' state is updated on AZ status change. The following changes are possible on a deleted instance
- arbitrary state -> unavailable (available -> unavailable is a particular case)
- unavailable -> available
 
- recovering a deleted instance does not touch the state
 
- deleted instances' state is updated on AZ status change. The following changes are possible on a deleted instance
- 
instances unsubscribe on deletion - states don't change while an instance is deleted
- instances would need to keep track of their AZ in a different way
- recovering a deleted instance does two new things
- resubscribe it to its AZ
- update its state according to AZ state
 
 
If this was being implemented from scratch, I would probably prefer 2 and drop the AZ manager entity entirely, as discussed with Andrei in early AZ days. I still don't see the benefit, but it's not the time to U-turn now.
So, that leaves option 1. IIUC, that corresponds to what is implemented, correct @Sploder12?
Which container
operative_instances is just a name. The idea was to contrast it with deleted_instances and indicate that those are disjoint collections. Previously it was called instances (IIRC), which sounded like a superset. If necessary we could rename again (e.g. retained_instances).
I would love to have a single collection, as @xmkg suggested, with ranges and "deleted" as just another state in the FSM (same representation). It would simplify a lot of code.
Consolidating all instances into a single container carries a cost: lookup and iteration speed.
In the sizes we're we're talking about, I am pretty sure it would be OK, if we did it right. If the deleted state was directly available in a plain field of the VM class, I would be willing to bet a user with 1000 deleted instances and 1 other instance would notice no degradation of performance with one vs two hash tables. 1000 iterations of "if X continue" are really fast, especially if X is already cached. It might even work the other way around due to cache locality. And I'd bet very few people ever reached 100 instances anyway.
We would have to do it right though, in the sense that we would need to make sure we shortcut the state update when it was deleted, preferably minimizing virtual calls in that case (another case for template method). 1000 iterations of going to the backend to update the state would be very noticeable.
I suspect that perf did not play much of a role in the decision to have 2 collections in the first place. My guess is that it was just the immediate representation of a trash can in code. Idk...
All that said, this would be a very significant refactor. I wonder if AZs are the right place for it... Especially given that we don't have ranges yet.
@ricab, shall we proceed as-is and refactor this later on?
Meh, probably... :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Sploder12. Overall this looks close to what we need, but I have a few comments for discussion.
2331482    to
    3c8321a      
    Compare
  
    751ae6a    to
    56ad56f      
    Compare
  
    93f4fc2    to
    c6a9e32      
    Compare
  
    f21b770    to
    3098c8f      
    Compare
  
    dc1613a    to
    dc819a6      
    Compare
  
    3098c8f    to
    6c91f0b      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there!
| if (vm.current_state() == VirtualMachine::State::unavailable) | ||
| { | ||
| mpl::log(mpl::Level::info, | ||
| vm.vm_name, | ||
| "Ignoring suspend since instance is unavailable."); | ||
| return grpc::Status::OK; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem: shouldn't this be handled by the VM? It would go in the direction of relieving the daemon from such concerns.
| m.available = true; | ||
| serialize(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we just do this at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the catch is rethrowing, control flow would never reach the end of the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. You could use a scope guard, but no need really.
| return; | ||
| } | ||
|  | ||
| was_running = state == State::running || state == State::starting; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the state was restarting?
b20f21b    to
    4029149      
    Compare
  
    26948b9    to
    eeceaee      
    Compare
  
    3655de5    to
    a66e7a8      
    Compare
  
    b7c9b32    to
    7104ebb      
    Compare
  
    2af4bcb    to
    acba722      
    Compare
  
    7104ebb    to
    f00a058      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have a few suggestions, but I will leave them to your consideration and pre-approve.
| name); | ||
| "Cannot start the instance '{}' while {}.", | ||
| name, | ||
| vm.current_state() == VirtualMachine::State::suspending ? "suspending" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be easier once the Hyper-V branch is in: https://github.com/canonical/multipass/pull/4080/files#diff-d847b18fbc46d755306da3eb6cac52de71043c1cc87ff345cdf8b034ad141ffaR158
You could add a TODO here, or convince @xmkg to split that off 😉
| if (state == State::unavailable) | ||
| { | ||
| throw VMStateIdempotentException{"Ignoring shutdown since instance is unavailable."}; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about merging with the above branch here as well. Again, the formatter would help. If you want to simplify, I wouldn't be shocked if the message read "Ignoring shutdown since instance is already unavailable".
| m.available = true; | ||
| serialize(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. You could use a scope guard, but no need really.
| } | ||
| else if (present_state == State::unavailable) | ||
| { | ||
| mpl::log(mpl::Level::info, vm_name, "Ignoring suspend since instance is unavailable"); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem: merge with the above?
Otherwise, use mpl::info?
| mpl::info(vm_name, "Ignoring suspend issued while stopped/suspended"); | ||
| mpl::log(mpl::Level::info, | ||
| vm_name, | ||
| "Ignoring suspend issued while stopped/suspended/unavailable"); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem: state formatting would help
| // setting the state here breaks encapsulation, but it's already broken. | ||
| std::unique_lock vm_lock{vm.get().state_mutex}; | ||
| if (vm.get().current_state() == VirtualMachine::State::unavailable) | ||
| { | ||
| vm.get().state = VirtualMachine::State::off; | ||
| vm.get().update_state(); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered force-stop instead? That would avoid breaking encapsulation yet again and make sure there were no leftover processes.
This PR adds implementations for making VMs unavailable and available. Also has unavailable VMs ignore commands that change anything related to that VM. There are a few known race conditions that are outside the scope of AZs to fix.
MULTI-1789
MULTI-1941