Skip to content

Conversation

@townsend2010
Copy link
Contributor

If no timeout is set, LXD uses a hardcoded 30 second timeout when waiting on
operations to complete and if the wait timeout occurs, it can lead to incorrect
behavior in the LXD backend.

Fixes #1777

If no timeout is set, LXD uses a hardcoded 30 second timeout when waiting on
operations to complete and if the wait timeout occurs, it can lead to incorrect
behavior in the LXD backend.

Fixes #1777
@codecov
Copy link

codecov bot commented Oct 30, 2020

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.95%. Comparing base (ccfdaaf) to head (a76ca22).
Report is 6600 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1821   +/-   ##
=======================================
  Coverage   76.95%   76.95%           
=======================================
  Files         229      229           
  Lines        8512     8512           
=======================================
  Hits         6550     6550           
  Misses       1962     1962           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Saviq
Copy link
Collaborator

Saviq commented Nov 4, 2020

The problem I see with this is that 60s just just as arbitrary as 30s… and the "Operation cancelled" error is somewhat devoid of detail (I know…). Can we try and convert a "Operation cancelled" to, say, "Likely timed out"? Is there no detail about why it was cancelled available at all?

@townsend2010
Copy link
Contributor Author

townsend2010 commented Nov 4, 2020

Well, as mentioned, without putting a timeout at all here, LXD has a hard coded timeout of 30 seconds, which is what is tripping us up on shutdown. It seems LXD has some sort of race between the hard coded operation timeout and the actual timeout of trying to shut down an instance. I've looked through the LXD code in this area and I really don't understand why they did what they did.

That saidt, setting an explicit timeout for the state operation puts the onus on the state change, not the operation wait part, so it gets around this issue. I can easily make it 30 seconds since that is the arbitrary wait we have in other backends, but I think allowing more time for an instance to shut down is better. In fact, we've had complaints in the past about trying to cleanly shut down busy instances and 30 seconds is not enough time.

Regarding Operation cancelled, this should avoid that particular issue since we really won't be waiting on the operation itself. Also, no LXD does not really offer any helpful in its error messaging. We could catch this particular error and say something like "Timeout occurred waiting on operation to complete."

@Saviq
Copy link
Collaborator

Saviq commented Nov 5, 2020

That set, setting an explicit timeout for the state operation puts the onus on the state change, not the operation wait part, so it gets around this issue. I can easily make it 30 seconds since that is the arbitrary wait we have in other backends, but I think allowing more time for an instance to shut down is better. In fact, we've had complaints in the past about trying to cleanly shut down busy instances and 30 seconds is not enough time.

OK, so I was missing that context, that this is different to the internal LXD timeout. Should we explain in a comment?

@townsend2010
Copy link
Contributor Author

Should we explain in a comment?

Sure, I can add a comment, but it'll be verbose in order to explain why this was added. But really, adding an explicit timeout here puts us in control of the timeout and should be used regardless of working around LXD idiosyncrasies.

@townsend2010
Copy link
Contributor Author

In some testing to check on some behaviors, I found some issues with this, so will continue working on it.

@townsend2010 townsend2010 marked this pull request as draft November 6, 2020 14:55
Base automatically changed from master to main March 3, 2021 13:41
@joes
Copy link

joes commented Sep 22, 2022

When I try to delete a multipass lxd instance it will quite often result in a timeout:

$ multipass delete k8-devnode-2
[2022-09-22T09:30:58.018] [error] [lxd request] Timeout getting response for GET operation on unix://multipass/var/snap/lxd/common/lxd/unix.socket@1.0/operations/36a26883-b3ec-4f5a-bd00-325cd5dfc150/wait?project=multipass
[2022-09-22T09:30:58.018] [error] [lxd request] Timeout getting response for GET operation on unix://multipass/var/snap/lxd/common/lxd/unix.socket@1.0/operations/36a26883-b3ec-4f5a-bd00-325cd5dfc150/wait?project=multipass
delete failed: Timeout getting response for GET operation on unix://multipass/var/snap/lxd/common/lxd/unix.socket@1.0/operations/36a26883-b3ec-4f5a-bd00-325cd5dfc150/wait?project=multipass

As for the VM it only got stopped (not deleted):

$ multipass list
k8-devnode-2          Stopped           --               Ubuntu 20.04 LTS

Would this pull request fix this @townsend2010 or should I file a separate issue?

$ multipass version
multipass   1.10.1
multipassd  1.10.1
$ multipass get local.driver
lxd
$ lxd version
5.5

I also expose multipassd to the network and set the environment for this:

echo $MULTIPASS_SERVER_ADDRESS
multipass.intra:51005

@jibel
Copy link
Collaborator

jibel commented Jan 16, 2025

Obsolete

@jibel jibel closed this Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[lxd] race-y deadlock on purge

5 participants