You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Wait for Crucible agent to do requested work (#3754)
Several endpoints of the Crucible agent do not perform work
synchronously with that endpoint's handler: changes are instead
requested and the actual work proceeds asynchronously. Nexus assumes
that the work is synchronous though.
This commit creates three functions in common_storage that properly deal
with an agent that does asynchronous work (`delete_crucible_region`,
`delete_crucible_snapshot`, and `delete_crucible_running_snapshot`), and
calls those.
Part of testing this commit was creating "disk" antagonists in the
omicron-stress tool. This uncovered cases where the transactions related
to disk creation and deletion failed to complete. One of these cases is
in `regions_hard_delete` - this transaction is now retried until it
succeeds. The `TransactionError::retry_transaction` function can be used
to see if CRDB is telling Nexus to retry the transaction. Another is
`decrease_crucible_resource_count_and_soft_delete_volume` - this was
broken into two steps, the first of which enumerates the read-only
resources to delete (because those don't currently change!).
Another fix that's part of this commit, exposed by the disk antagonist:
it should only be ok to delete a disk in state `Creating` if you're in
the disk create saga. If this is not true, it's possible for a delete of
a disk currently in the disk create saga to cause that saga to fail
unwinding and remain stuck.
This commit also bundles idempotency related fixes for the simulated
Crucible agent, as these were exposed with the retries that are
performed by the new `delete_crucible_*` functions.
What shook out of this is that `Nexus::volume_delete` was removed, as
this was not correct to call during the unwind of the disk and snapshot
create sagas: it would conflict with what those sagas were doing as part
of their unwind. The volume delete saga is still required during disk
and snapshot deletion, but there is enough context during the disk and
snapshot create sagas to properly unwind a volume, even in the case of
read-only parents. `Nexus::volume_delete` ultimately isn't safe, so it
was removed: instead, any future sagas should either embed the volume
delete saga as a sub saga, or there should be enough context in the
outputs said future saga's nodes to properly unwind what was created.
Depends on oxidecomputer/crucible#838, which
fixes a few non-idempotency bugs in the actual Crucible agent.
Fixes#3698
0 commit comments